Fix LOCK_TIMEOUT handling in slotsync worker

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Fix LOCK_TIMEOUT handling in slotsync worker
Date: 2025-12-08 02:04:27
Message-ID: TY4PR01MB169078F33846E9568412D878C94A2A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Previously, the slotsync worker used SIGINT to receive a graceful shutdown
signal from the startup process on promotion. However, SIGINT is also used by
the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
slotsync worker can access and lock catalog tables while parsing libpq tuples,
this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
signals and consequently waiting indefinitely on locks.

I can reproduce the issue by:

1) create a failover replication slot for slotsync on primary.
2) start slotsync worker on standby and uses gdb to make the slotsync
worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
3) take ACCESS EXCLUSIVE lock on pg_type on primary.
4) log standby snapshot to replicate the lock to standby.
5) release the slotsync worker, it will start waiting for the lock on pg_type to
be released. And on HEAD, it would not be canceled by the lock_timeout
setting.

Here is a patch to resolve this by replacing the current signal handler with the
appropriate StatementCancelHandler for SIGINT within the slotsync worker.
Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
slotsync of the need to stop during promotion. The slotsync worker now stops
upon detecting that the shared memory flag (stopSignaled) is set to true.

I did not add a tap-test in the patch for now. Although feasible, it requires
a strong lock on a catalog and an injection point to control the
process.

Best Regards,
Hou zj

Attachment Content-Type Size
v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch application/octet-stream 2.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message cca5507 2025-12-08 02:07:53 Re: Fix incorrect comments in tuplesort.c
Previous Message Zhijie Hou (Fujitsu) 2025-12-08 01:56:42 RE: Newly created replication slot may be invalidated by checkpoint