Deadlock between logrep apply worker and tablesync worker

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Deadlock between logrep apply worker and tablesync worker
Date: 2023-01-22 19:59:34
Message-ID: 1412708.1674417574@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On my machine, the src/test/subscription/t/002_types.pl test
usually takes right about 1.5 seconds:

$ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl
...
[14:22:12] t/002_types.pl .. ok 1550 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.25 csys = 0.95 CPU)
[14:22:13]

I noticed however that sometimes (at least one try in ten, for me)
it takes about 2.5 seconds:

[14:22:16] t/002_types.pl .. ok 2591 ms ( 0.00 usr 0.00 sys + 0.69 cusr 0.28 csys = 0.97 CPU)
[14:22:18]

and I've even seen 3.5 seconds. I dug into this and eventually
identified the cause: it's a deadlock between a subscription's apply
worker and a tablesync worker that it's spawned. Sometimes the apply
worker calls wait_for_relation_state_change (to wait for the tablesync
worker to finish) while it's holding a lock on pg_replication_origin.
If that's the case, then when the tablesync worker reaches
process_syncing_tables_for_sync it is able to perform
UpdateSubscriptionRelState and reach the transaction commit below
that; but when it tries to do replorigin_drop_by_name a little further
down, it blocks on acquiring ExclusiveLock on pg_replication_origin.
So we have an undetected deadlock. We escape that because
wait_for_relation_state_change has a 1-second timeout, after which
it rechecks GetSubscriptionRelState and is able to see the committed
relation state change; so it continues, and eventually releases its
transaction and the lock, permitting the tablesync worker to finish.

I've not tracked down the exact circumstances in which the apply
worker ends up holding a problematic lock, but it seems likely
that it corresponds to cases where its main loop has itself called
replorigin_drop_by_name, a bit further up, for some other concurrent
tablesync operation. (In all the cases I've traced through, the apply
worker is herding multiple tablesync workers when this happens.)

I experimented with having the apply worker release its locks
before waiting for the tablesync worker, as attached. This passes
check-world and it seems to eliminate the test runtime instability,
but I wonder whether it's semantically correct. This whole business
of taking table-wide ExclusiveLock on pg_replication_origin looks
like a horrid kluge that we should try to get rid of, not least
because I don't see any clear documentation of what hazard it's
trying to prevent.

Another thing that has a bad smell about it is the fact that
process_syncing_tables_for_sync uses two transactions in the first
place. There's a comment there claiming that it's for crash safety,
but I can't help suspecting it's really because this case becomes a
hard deadlock without that mid-function commit.

It's not great in any case that the apply worker can move on in
the belief that the tablesync worker is done when in fact the latter
still has catalog state updates to make. And I wonder what we're
doing with having both of them calling replorigin_drop_by_name
... shouldn't that responsibility belong to just one of them?

So I think this whole area deserves a hard look, and I'm not at
all sure that what's attached is a good solution.

regards, tom lane

Attachment Content-Type Size
hack-to-prevent-logrep-worker-deadlocks.patch text/x-diff 1.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-22 20:04:14 Re: Remove source code display from \df+?
Previous Message Isaac Morland 2023-01-22 19:53:48 Re: Remove source code display from \df+?