Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
Date: 2024-02-05 10:13:41
Message-ID: CALDaNm16SW3X2snvTKu-YQxGWGhjWLWj5H9eqXc-bTK4inUtJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> Hello Vignesh and Hou-san,
>
> 01.02.2024 07:59, vignesh C wrote:
> > Here is an updated patch which changes the boolean variable to a
> > tri-state enum and set stable state to valid only if no invalidations
> > have been occurred while the list is being prepared.
> >
>
> While testing the v3 patch, I observed another anomaly with the 014_binary
> test. When I run the test in parallel, I see that sometimes one of the test
> instances runs much longer than others. For example:
> 49 All tests successful.
> 49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr 0.30 csys = 0.77 CPU)
> 49 Result: PASS
> 12 All tests successful.
> 12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr 0.40 csys = 0.89 CPU)
> 12 Result: PASS
>
> As far as I can see, this happens due to another race condition, this time
> in launcher.c.
> For such a three-minute case I see in _subscriber.log:
> 2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation "public.test_mismatching_types" does not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message
> type "INSERT" in transaction 757, finished at 0/153C838
> 2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target relation "public.test_mismatching_types" does
> not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message
> type "INSERT" in transaction 757, finished at 0/153C838
> ...
> 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE TABLE public.test_mismatching_types (
> a int PRIMARY KEY
> );
> 2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker "logical replication apply worker for
> subscription 16398"
> 2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical replication apply worker" (PID 949255) exited with
> exit code 1
> ...
> 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> ...
> 2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication worker for subscription "tsub"
> (there is no interesting activity between 14:33:13 and 14:36:13)
>
> So logical replication apply worker exited because CREATE TABLE was not
> executed on subscriber yet, and new replication worker started because of a
> timeout occurred in WaitLatch(..., wait_time, ...) inside
> ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
> (180 sec)).
>
> But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
> a result of:
> logical replication apply worker| logicalrep_worker_onexit() ->
> ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).
>
> In a bad case, I see that the SetLatch() called as well, but then the latch
> gets reset by the following code in WaitForReplicationWorkerAttach():
> rc = WaitLatch(MyLatch,
> WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> 10L, WAIT_EVENT_BGWORKER_STARTUP);
>
> if (rc & WL_LATCH_SET)
> {
> ResetLatch(MyLatch);
> CHECK_FOR_INTERRUPTS();
> }
>
> With pg_usleep(300000); added just before ResetLatch and
> $node_subscriber->safe_psql(
> 'postgres', qq(
> +SELECT pg_sleep(0.5);
> CREATE TABLE public.test_mismatching_types (
> in 014_binary.pl, I can see the anomaly without running tests in parallel,
> just when running that test in a loop:
> for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done
> ...
> iteration 2
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 csys = 0.42 CPU)
> Result: PASS
> iteration 3
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 csys = 0.55 CPU)
> Result: PASS
> ...
>
> In other words, the abnormal test execution takes place due to the
> following events:
> 1. logicalrep worker launcher launches replication worker and waits for it
> to attach:
> ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach()
> 2. logicalrep worker exits due to some error (logical replication target
> relation does not exist, in our case) and sends a signal to set the latch
> for launcher
> 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
> it inside WaitForReplicationWorkerAttach()
> 4. launcher gets back to ApplyLauncherMain() where it waits for the latch
> (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).
>
> Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
> test 027_nosuperuser executes for 3+ minutes on each run for me.

Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.

Regards,
Vignesh

Attachment Content-Type Size
v4-0001-Table-sync-missed-for-newly-added-tables-because-.patch text/x-patch 3.3 KB
v4-0002-Apply-worker-will-get-started-after-180-seconds-b.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-02-05 10:17:07 Re: meson: catalog/syscache_ids.h isn't installed
Previous Message Alvaro Herrera 2024-02-05 09:53:17 Re: cataloguing NOT NULL constraints