processes stuck in shutdown following OOM/recovery

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Thomas Munro <tmunro(at)postgresql(dot)org>
Subject: processes stuck in shutdown following OOM/recovery
Date: 2023-12-01 05:13:25
Message-ID: ZWlrdQarrZvLsgIk@pryzbyj2023
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

If postgres starts, and one of its children is immediately killed, and
the cluster is also told to stop, then, instead, the whole system gets
wedged.

$ initdb -D ./pgdev.dat1
$ pg_ctl -D ./pgdev.dat1 start -o '-c port=5678'
$ kill -9 2524495; sleep 0.05; pg_ctl -D ./pgdev.dat1 stop -m fast # 2524495 is a child's pid
.......................................................... failed
pg_ctl: server does not shut down

$ ps -wwwf --ppid 2524494
UID PID PPID C STIME TTY TIME CMD
pryzbyj 2524552 2524494 0 20:47 ? 00:00:00 postgres: checkpointer

(gdb) bt
#0 0x00007f0ce2d08c03 in epoll_wait (epfd=10, events=0x55cb4cbaac28, maxevents=1, timeout=timeout(at)entry=156481) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x000055cb4c219208 in WaitEventSetWaitBlock (set=set(at)entry=0x55cb4cbaabc0, cur_timeout=cur_timeout(at)entry=156481, occurred_events=occurred_events(at)entry=0x7ffd80130410,
nevents=nevents(at)entry=1) at ../src/backend/storage/ipc/latch.c:1583
#2 0x000055cb4c219e02 in WaitEventSetWait (set=0x55cb4cbaabc0, timeout=timeout(at)entry=300000, occurred_events=occurred_events(at)entry=0x7ffd80130410, nevents=nevents(at)entry=1,
wait_event_info=wait_event_info(at)entry=83886084) at ../src/backend/storage/ipc/latch.c:1529
#3 0x000055cb4c219f87 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents(at)entry=41, timeout=timeout(at)entry=300000, wait_event_info=wait_event_info(at)entry=83886084)
at ../src/backend/storage/ipc/latch.c:539
#4 0x000055cb4c1aabc2 in CheckpointerMain () at ../src/backend/postmaster/checkpointer.c:523
#5 0x000055cb4c1a8207 in AuxiliaryProcessMain (auxtype=auxtype(at)entry=CheckpointerProcess) at ../src/backend/postmaster/auxprocess.c:153
#6 0x000055cb4c1ae63d in StartChildProcess (type=type(at)entry=CheckpointerProcess) at ../src/backend/postmaster/postmaster.c:5331
#7 0x000055cb4c1b07f3 in ServerLoop () at ../src/backend/postmaster/postmaster.c:1792
#8 0x000055cb4c1b1c56 in PostmasterMain (argc=argc(at)entry=5, argv=argv(at)entry=0x55cb4cbaa380) at ../src/backend/postmaster/postmaster.c:1466
#9 0x000055cb4c0f4c1b in main (argc=5, argv=0x55cb4cbaa380) at ../src/backend/main/main.c:198

I noticed this because of the counter-effective behavior of systemd+PGDG
unit files to run "pg_ctl stop" whenever a backend is killed for OOM:
https://www.postgresql.org/message-id/ZVI112aVNCHOQgfF@pryzbyj2023

This affects v15, and fails at 7ff23c6d27 but not its parent.

commit 7ff23c6d277d1d90478a51f0dd81414d343f3850 (HEAD)
Author: Thomas Munro <tmunro(at)postgresql(dot)org>
Date: Mon Aug 2 17:32:20 2021 +1200

Run checkpointer and bgwriter in crash recovery.

--
Justin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-12-01 05:26:43 Re: pg_upgrade and logical replication
Previous Message Michael Paquier 2023-12-01 05:00:54 Sequence Access Methods, round two