undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Date: 2023-11-21 11:47:38
Message-ID: d291bb50-12c4-e8af-2af2-7bb9bb4d8e3e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I decided to do some stress-testing of the built-in logical replication,
as part of the sequence decoding work. And I soon ran into an undetected
deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(

The attached bash scripts triggers that in a couple seconds for me. The
script looks complicated, but most of the code is waiting for sync to
complete, catchup, and that sort of thing.

What the script does is pretty simple:

1) initialize two clusters, set them as publisher/subscriber pair

2) create some number of tables, add them to publication and wait for
the sync to complete

3) start two pgbench runs in the background, modifying the publication
(one removes+adds all tables in a single transaction, one does that
with transaction per table)

4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
in a loop (now that I think about it, could be another pgbench
script, but well ...)

5) some consistency checks, but the lockup happens earlier so this does
not really matter

After a small number of refresh cycles (for me it's usually a couple
dozen), we end up with a couple stuck locks (I shortened the backend
type string a bit, for formatting reasons):

test=# select a.pid, classid, objid, backend_type, query
from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
where not granted;

pid | classid | objid | backend_type | query
---------+---------+-------+------------------+----------------------
2691941 | 6100 | 16785 | client backend | ALTER SUBSCRIPTION s
REFRESH PUBLICATION
2691837 | 6100 | 16785 | tablesync worker |
2691936 | 6100 | 16785 | tablesync worker |
(3 rows)

All these backends wait for 6100/16785, which is the subscription row in
pg_subscription. The tablesync workers are requesting AccessShareLock,
the client backend however asks for AccessExclusiveLock.

The entry is currently locked by:

test=# select a.pid, mode, backend_type from pg_locks l
join pg_stat_activity a on (a.pid = l.pid)
where classid=6100 and objid=16785 and granted;

pid | mode | backend_type
---------+-----------------+----------------------------------
2690477 | AccessShareLock | logical replication apply worker
(1 row)

But the apply worker is not waiting for any locks, so what's going on?

Well, the problem is the apply worker is waiting for notification from
the tablesync workers the relation is synced, which happens through
updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop,
with a sleep by WaitLatch().

Unfortunately, the tablesync workers can't update the row because the
client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync
workers are stuck in the queue and can't proceed.

The client backend can't proceed, because it's waiting for a lock held
by the apply worker.

The tablesync workers can't proceed because their lock request is stuck
behind the AccessExclusiveLock request.

And the apply worker can't proceed, because it's waiting for status
update from the tablesync workers.

And the deadlock is undetected, because the apply worker is not waiting
on a lock, but sleeping on a latch :-(

I don't know what's the right solution here. I wonder if the apply
worker might release the lock before waiting for the update, that'd
solve this whole issue.

Alternatively, ALTER PUBLICATION might wait for the lock only for a
limited amount of time, and try again (but then it'd be susceptible to
starving, of course).

Or maybe there's a way to make this work in a way that would be visible
to the deadlock detector? That'd mean we occasionally get processes
killed to resolve a deadlock, but that's still better than processes
stuck indefinitely ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
lockup-test.sh application/x-shellscript 9.3 KB
refresh.sh application/x-shellscript 466 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-11-21 13:16:59 Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Previous Message David Steele 2023-11-21 11:42:42 Re: Add recovery to pg_control and remove backup_label