| From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> | 
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org | 
| Cc: | Thomas Munro <tmunro(at)postgresql(dot)org> | 
| Subject: | processes stuck in shutdown following OOM/recovery | 
| Date: | 2023-12-01 05:13:25 | 
| Message-ID: | ZWlrdQarrZvLsgIk@pryzbyj2023 | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
If postgres starts, and one of its children is immediately killed, and
the cluster is also told to stop, then, instead, the whole system gets
wedged.
$ initdb -D ./pgdev.dat1
$ pg_ctl -D ./pgdev.dat1 start -o '-c port=5678'
$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid
.......................................................... failed
pg_ctl: server does not shut down
$ ps -wwwf --ppid 2524494
UID          PID    PPID  C STIME TTY          TIME CMD
pryzbyj  2524552 2524494  0 20:47 ?        00:00:00 postgres: checkpointer 
(gdb) bt
#0  0x00007f0ce2d08c03 in epoll_wait (epfd=10, events=0x55cb4cbaac28, maxevents=1, timeout=timeout(at)entry=156481) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055cb4c219208 in WaitEventSetWaitBlock (set=set(at)entry=0x55cb4cbaabc0, cur_timeout=cur_timeout(at)entry=156481, occurred_events=occurred_events(at)entry=0x7ffd80130410, 
    nevents=nevents(at)entry=1) at ../src/backend/storage/ipc/latch.c:1583
#2  0x000055cb4c219e02 in WaitEventSetWait (set=0x55cb4cbaabc0, timeout=timeout(at)entry=300000, occurred_events=occurred_events(at)entry=0x7ffd80130410, nevents=nevents(at)entry=1, 
    wait_event_info=wait_event_info(at)entry=83886084) at ../src/backend/storage/ipc/latch.c:1529
#3  0x000055cb4c219f87 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents(at)entry=41, timeout=timeout(at)entry=300000, wait_event_info=wait_event_info(at)entry=83886084)
    at ../src/backend/storage/ipc/latch.c:539
#4  0x000055cb4c1aabc2 in CheckpointerMain () at ../src/backend/postmaster/checkpointer.c:523
#5  0x000055cb4c1a8207 in AuxiliaryProcessMain (auxtype=auxtype(at)entry=CheckpointerProcess) at ../src/backend/postmaster/auxprocess.c:153
#6  0x000055cb4c1ae63d in StartChildProcess (type=type(at)entry=CheckpointerProcess) at ../src/backend/postmaster/postmaster.c:5331
#7  0x000055cb4c1b07f3 in ServerLoop () at ../src/backend/postmaster/postmaster.c:1792
#8  0x000055cb4c1b1c56 in PostmasterMain (argc=argc(at)entry=5, argv=argv(at)entry=0x55cb4cbaa380) at ../src/backend/postmaster/postmaster.c:1466
#9  0x000055cb4c0f4c1b in main (argc=5, argv=0x55cb4cbaa380) at ../src/backend/main/main.c:198
I noticed this because of the counter-effective behavior of systemd+PGDG
unit files to run "pg_ctl stop" whenever a backend is killed for OOM:
https://www.postgresql.org/message-id/ZVI112aVNCHOQgfF@pryzbyj2023
This affects v15, and fails at 7ff23c6d27 but not its parent.
commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 (HEAD)
Author: Thomas Munro <tmunro(at)postgresql(dot)org>
Date:   Mon Aug 2 17:32:20 2021 +1200
Run checkpointer and bgwriter in crash recovery.
-- 
Justin
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Smith | 2023-12-01 05:26:43 | Re: pg_upgrade and logical replication | 
| Previous Message | Michael Paquier | 2023-12-01 05:00:54 | Sequence Access Methods, round two |