Re: Logrep launcher race conditions leading to slow tests

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Logrep launcher race conditions leading to slow tests
Date: 2025-06-24 10:20:37
Message-ID: CAA4eK1+yLAhA6Pxe1V-bEnDOx7=oR+rinK3mf_DoH3BgbM+jXw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 24, 2025 at 5:26 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I've been annoyed for awhile because, while a parallel check-world
> run usually takes a bit over a minute on my machine, sometimes it
> takes between three and four minutes. I was finally able to
> track down what is happening, and it's this: sometimes one or
> another of the src/test/subscription tests takes an extra three
> minutes because the logical replication launcher is sleeping
> instead of launching the next task. It eventually reaches its
> hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
> wakes up and notices it has something to do, and then we're
> on our merry way again. I'm not sure how often this is a problem
> in the real world, but it happens often enough to be annoying
> during development.
>
> There are two distinct bugs involved:
>
> 1. WaitForReplicationWorkerAttach sometimes has to clear a process
> latch event so that it can keep waiting for the worker to launch.
> It neglects to set the latch again, allowing ApplyLauncherMain
> to miss events.
>

There was a previous discussion to fix this behavior. Heikki has
proposed a similar fix for this, but at the caller. See the patch
attached in email [1]. The comments in his patch are more explicit
about the kind of events that are missed. Then there is also a
discussion solving it by having two latches, one for worker exiting
and the other for subscription changes. There is a slight advantage
with the approach of using two latches, which is probably to avoid
extra looping to traverse the subscriptions when nothing has changed.
I find one of your or Heikki's proposals to fix this issue is good
enough.

> 2. ApplyLauncherMain ignores a failure return from
> logicalrep_worker_launch, which is bad because (unless it has
> another worker launch pending) it will then sleep for
> DEFAULT_NAPTIME_PER_CYCLE before reconsidering. What it ought to do
> is try again after wal_retrieve_retry_interval. This situation can
> arise in resource-exhaustion cases (cf. the early exits in
> logicalrep_worker_launch), but what's more likely in the regression
> tests is that the worker stops due to some error condition before
> WaitForReplicationWorkerAttach sees it attached, which is then duly
> reported as a failure.
>

LGTM.

> It's possible to make the test slowness extremely reproducible
> with this change, which widens the race condition window for
> both problems:
>
> diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
> index 1c3c051403d..724e82bcdc1 100644
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
> */
> rc = WaitLatch(MyLatch,
> WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> - 10L, WAIT_EVENT_BGWORKER_STARTUP);
> + 1000L, WAIT_EVENT_BGWORKER_STARTUP);
>
> if (rc & WL_LATCH_SET)
> {
>
> I don't recommend that as a permanent change, but it's helpful
> for testing the attached patch.
>
> In the attached, I made two other non-cosmetic changes:
>
> 3. In WaitForReplicationWorkerAttach, capture worker->in_use
> before not after releasing LogicalRepWorkerLock. Maybe there
> is a reason why that's not a dangerous race condition, but
> it sure is un-obvious to me.
>

I also can't think of a reason to use current coding. Your change
looks good to me.

> 4. In process_syncing_tables_for_apply (the other caller of
> logicalrep_worker_launch), it seems okay to ignore the
> result of logicalrep_worker_launch, but I think it should
> fill hentry->last_start_time before not after the call.

With this, won't we end up retrying to launch the worker sooner if the
launch took time, but still failed to launch the worker?

[1] - https://www.postgresql.org/message-id/ff0663d9-8011-420f-a169-efbf57327cb5%40iki.fi

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2025-06-24 10:20:54 Re: Conflict detection for update_deleted in logical replication
Previous Message Pavel Luzanov 2025-06-24 10:02:04 Re: Things I don't like about \du's "Attributes" column