logical apply worker's lock waits in subscriber can stall checkpointer in publisher

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: logical apply worker's lock waits in subscriber can stall checkpointer in publisher
Date: 2026-01-27 11:32:57
Message-ID: CAHGQGwFOW_EWtUa-8sTL21KGsWy76CaQZF-FarZqur2RONk3nA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While reviewing the patch at [1], I noticed a case where lock waits on
a logical apply worker in the subscriber can cause the checkpointer on
the publisher to stall. This seems like unexpected behavior and may
need to be addressed.

The issue can occur as follows:

1. A logical apply worker on the subscriber blocks waiting for a lock.
2. Because the apply worker cannot receive further messages, the walsender's
send buffer on the publisher becomes full.
3. If the walsender then encounters a max_slot_wal_keep_size error,
it attempts to send an error message to the subscriber before exiting.
However, with a full send buffer, the walsender blocks while trying to
send this message.
4. The checkpointer on the publisher calls InvalidateObsoleteReplicationSlots()
and waits for the slot to be released. Since the walsender is stuck and
the slot is not released, the checkpointer also becomes stuck.

This behavior seems problematic, isn't it?

One possible approach to address this issue would be to make the walsender
send the error message in non-blocking mode. Even if the send buffer is full,
the walsender could then exit, allowing the slot to be released and
the checkpointer to proceed. This would mean that, in some cases,
the final error message might not reach the subscriber, which seems
acceptable to me, though others may disagree.

This approach would also help when users want to terminate a walsender
via pg_terminate_backend() but the send buffer is full. In this case, today,
the walsender can similarly block while trying to send the error message.

Another idea would be to change the checkpointer so that
InvalidateObsoleteReplicationSlots() operates in a non-blocking manner.
I'm not sure whether that is feasible, but if immediate invalidation is not
strictly required, the checkpointer could give up and retry later.

Thoughts?

Regards,

[1]
https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com

--
Fujii Masao

Browse pgsql-hackers by date

  From Date Subject
Previous Message Zsolt Parragi 2026-01-27 11:10:56 Re: Time to add FIDO2 support?