| From: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
|---|---|
| To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Subject: | Re: Fix LOCK_TIMEOUT handling in slotsync worker |
| Date: | 2025-12-08 04:36:21 |
| Message-ID: | CAJpy0uDzEwSm=1=xnK3o_=W5SfgP2TEVz50-xGk434AKWpE1Og@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Dec 8, 2025 at 7:34 AM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> Hi,
>
> Previously, the slotsync worker used SIGINT to receive a graceful shutdown
> signal from the startup process on promotion. However, SIGINT is also used by
> the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
> slotsync worker can access and lock catalog tables while parsing libpq tuples,
> this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
> signals and consequently waiting indefinitely on locks.
>
> I can reproduce the issue by:
>
> 1) create a failover replication slot for slotsync on primary.
> 2) start slotsync worker on standby and uses gdb to make the slotsync
> worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
> libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
> 3) take ACCESS EXCLUSIVE lock on pg_type on primary.
> 4) log standby snapshot to replicate the lock to standby.
> 5) release the slotsync worker, it will start waiting for the lock on pg_type to
> be released. And on HEAD, it would not be canceled by the lock_timeout
> setting.
>
> Here is a patch to resolve this by replacing the current signal handler with the
> appropriate StatementCancelHandler for SIGINT within the slotsync worker.
> Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
> slotsync of the need to stop during promotion. The slotsync worker now stops
> upon detecting that the shared memory flag (stopSignaled) is set to true.
>
> I did not add a tap-test in the patch for now. Although feasible, it requires
> a strong lock on a catalog and an injection point to control the
> process.
>
Thanks for the patch. I agree with the issue mentioned and can
reproduce it on HEAD; verified that the patch fixes it.
The patch looks good to me.
thanks
Shveta
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nitin Jadhav | 2025-12-08 04:38:00 | Re: Fix crash during recovery when redo segment is missing |
| Previous Message | Dilip Kumar | 2025-12-08 04:06:43 | Re: Proposal: Conflict log history table for Logical Replication |