| From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Subject: | Fix LOCK_TIMEOUT handling in slotsync worker |
| Date: | 2025-12-08 02:04:27 |
| Message-ID: | TY4PR01MB169078F33846E9568412D878C94A2A@TY4PR01MB16907.jpnprd01.prod.outlook.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Previously, the slotsync worker used SIGINT to receive a graceful shutdown
signal from the startup process on promotion. However, SIGINT is also used by
the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
slotsync worker can access and lock catalog tables while parsing libpq tuples,
this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
signals and consequently waiting indefinitely on locks.
I can reproduce the issue by:
1) create a failover replication slot for slotsync on primary.
2) start slotsync worker on standby and uses gdb to make the slotsync
worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
3) take ACCESS EXCLUSIVE lock on pg_type on primary.
4) log standby snapshot to replicate the lock to standby.
5) release the slotsync worker, it will start waiting for the lock on pg_type to
be released. And on HEAD, it would not be canceled by the lock_timeout
setting.
Here is a patch to resolve this by replacing the current signal handler with the
appropriate StatementCancelHandler for SIGINT within the slotsync worker.
Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
slotsync of the need to stop during promotion. The slotsync worker now stops
upon detecting that the shared memory flag (stopSignaled) is set to true.
I did not add a tap-test in the patch for now. Although feasible, it requires
a strong lock on a catalog and an injection point to control the
process.
Best Regards,
Hou zj
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch | application/octet-stream | 2.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | cca5507 | 2025-12-08 02:07:53 | Re: Fix incorrect comments in tuplesort.c |
| Previous Message | Zhijie Hou (Fujitsu) | 2025-12-08 01:56:42 | RE: Newly created replication slot may be invalidated by checkpoint |