Logrep launcher race conditions leading to slow tests

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Logrep launcher race conditions leading to slow tests
Date: 2025-06-23 23:56:47
Message-ID: 817604.1750723007@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been annoyed for awhile because, while a parallel check-world
run usually takes a bit over a minute on my machine, sometimes it
takes between three and four minutes. I was finally able to
track down what is happening, and it's this: sometimes one or
another of the src/test/subscription tests takes an extra three
minutes because the logical replication launcher is sleeping
instead of launching the next task. It eventually reaches its
hard-wired maximum wait of DEFAULT_NAPTIME_PER_CYCLE (3min),
wakes up and notices it has something to do, and then we're
on our merry way again. I'm not sure how often this is a problem
in the real world, but it happens often enough to be annoying
during development.

There are two distinct bugs involved:

1. WaitForReplicationWorkerAttach sometimes has to clear a process
latch event so that it can keep waiting for the worker to launch.
It neglects to set the latch again, allowing ApplyLauncherMain
to miss events.

2. ApplyLauncherMain ignores a failure return from
logicalrep_worker_launch, which is bad because (unless it has
another worker launch pending) it will then sleep for
DEFAULT_NAPTIME_PER_CYCLE before reconsidering. What it ought to do
is try again after wal_retrieve_retry_interval. This situation can
arise in resource-exhaustion cases (cf. the early exits in
logicalrep_worker_launch), but what's more likely in the regression
tests is that the worker stops due to some error condition before
WaitForReplicationWorkerAttach sees it attached, which is then duly
reported as a failure.

It's possible to make the test slowness extremely reproducible
with this change, which widens the race condition window for
both problems:

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 1c3c051403d..724e82bcdc1 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -214,7 +214,7 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
*/
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
- 10L, WAIT_EVENT_BGWORKER_STARTUP);
+ 1000L, WAIT_EVENT_BGWORKER_STARTUP);

if (rc & WL_LATCH_SET)
{

I don't recommend that as a permanent change, but it's helpful
for testing the attached patch.

In the attached, I made two other non-cosmetic changes:

3. In WaitForReplicationWorkerAttach, capture worker->in_use
before not after releasing LogicalRepWorkerLock. Maybe there
is a reason why that's not a dangerous race condition, but
it sure is un-obvious to me.

4. In process_syncing_tables_for_apply (the other caller of
logicalrep_worker_launch), it seems okay to ignore the
result of logicalrep_worker_launch, but I think it should
fill hentry->last_start_time before not after the call.
Otherwise we might be changing a hashtable entry that's
no longer relevant to this worker. I'm not sure exactly
where the failed worker will be cleaned-up-after, but
it could very easily be out of the system entirely before
logicalrep_worker_launch returns.

Barring objections, I plan to apply and back-patch this.

regards, tom lane

Attachment Content-Type Size
fix-logrep-launcher-race-conditions.patch text/x-diff 3.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2025-06-24 00:06:27 Re: Tags in the commitfest app: How to use them and what tags to add?
Previous Message Tatsuo Ishii 2025-06-23 23:44:31 Re: BackendKeyData is mandatory?