| From: | Sami Imseih <samimseih(at)gmail(dot)com> |
|---|---|
| To: | Baji Shaik <baji(dot)pgdev(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, alvherre(at)kurilemu(dot)de |
| Subject: | Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit |
| Date: | 2026-05-13 03:45:07 |
| Message-ID: | CAA5RZ0sYXTGQK=JStyFv9p12sZk3Vc0_9E10HbgJa47CCoGfQQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Thanks for reporting. This indeed looks like a bug.
With pg_terminate_backend, the logical replication worker has no
way to know that it needs to stop, as the PG_FINALLY is not
reached in this case.
I think registering a callback to terminate the worker is the proper fix,
but I don't think on_proc_exit() is the right place to register the
callback.
With 0001 applied and building with asserts, I see a segfault.
postgres=# select pg_terminate_backend(26707);
pg_terminate_backend
----------------------
t
(1 row)
```
postgres=# select 1;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
postgres=?#
```
```
2026-05-12 21:50:33.866 CDT [26569] LOG: client backend (PID 26707)
was terminated by signal 11: Segmentation fault: 11
2026-05-12 21:50:33.866 CDT [26569] LOG: terminating any other active
server processes
2026-05-12 21:50:33.872 CDT [26569] LOG: all server processes
terminated; reinitializing
2026-05-12 21:50:33.882 CDT [27131] LOG: database system was
interrupted; last known up at 2026-05-12 21:45:39 CDT
2026-05-12 21:50:34.278 CDT [27131] LOG: database system was not
properly shut down; automatic recovery in progress
2026-05-12 21:50:34.281 CDT [27131] LOG: redo starts at 13/619E9470
```
From lldb on my Mac, I see
```
Process 22683 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason =
EXC_BAD_ACCESS (code=1, address=0x7f7f7f7f7f7f7f7f)
frame #0: 0x00000001044c607c
postgres`TerminateBackgroundWorker(handle=0x7f7f7f7f7f7f7f7f) at
bgworker.c:1324:2 [opt]
1321 BackgroundWorkerSlot *slot;
1322 bool signal_postmaster = false;
1323
-> 1324 Assert(handle->slot < max_worker_processes);
1325 slot = &BackgroundWorkerData->slot[handle->slot];
1326
1327 /* Set terminate flag in shared memory, unless
slot has been reused. */
```
The 0x7f7f7f7f7f7f7f7f is the CLOBBER_FREED_MEMORY fill pattern from
wipe_mem(). The handle's memory context has already been destroyed by
the time on_proc_exit callbacks run.
A better fix is to use before_shmem_exit instead, which is for
user-level cleanup.
/* ----------------------------------------------------------------
* before_shmem_exit
*
* Register early callback to perform user-level cleanup,
If we do that, we can also wait for the worker to shutdown, so we can use
stop_repack_decoding_worker();
What do you think?
--
Sami Imseih
Amazon Web Services (AWS)
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Fix-REPACK-decoding-worker-not-cleaned-up-on-FATA.patch | application/octet-stream | 2.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Fujii Masao | 2026-05-13 03:16:14 | Re: [PATCH] Fix psql tab completion for REPACK boolean options |