Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU

From: Klaus Darilion <klaus(dot)darilion(at)nic(dot)at>
To: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU
Date: 2025-07-21 10:47:40
Message-ID: DBAPR03MB6358854AD71C8ABA5CA10A8DF15DA@DBAPR03MB6358.eurprd03.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general

(Note: I have also attached the whole email for better readability of the logs)

Hello!

Our setup: 5 Node Patroni Cluster with PostgreSQL 16.9.
db1: current leader
db2: sync-replica
db3/4/5: replica

The replicas connect to the leader using the host IP of the leader. So there are 4 walsender for patroni, 1 sync and 3 async.

The patroni cluster utilizes a service IP-address (VIP). The VIP is used by all clients connecting to the current leader. These clients are:
- some web-apps doing normal DB queries (read/write)
- 2 barman backup clients using streaming replication
- 58 logical replication clients

Additionally we use https://github.com/EnterpriseDB/pg_failover_slots to sync and advance the logical replication slots on the replicas. The failover_slots plugin periodically connects to leader using the VIP.

We had a planned maintenance and wanted to switch the leader from db1 to db2:
12:32:18: patronictl switchover --leader db1 --candidate db2

So postmaster received the fast shutdown request from Patroni and started shutting down the client connection processes:
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 2748 LOG: received fast shutdown request
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 2748 LOG: aborting any active transactions
12:32:42 db1 patroni[1842780]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 1842780 FATAL: terminating connection due to administrator command
12:32:42 db1 patroni[1842702]: patroni_postgres: 2025-07-16 12:32:42.224 UTC 1842702 FATAL: terminating connection due to administrator command
...
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.555 UTC 2748 LOG: background worker "logical replication launcher" (PID 99854) exited with exit code 1
12:32:42 db1 patroni[3935]: patroni_postgres: 2025-07-16 12:32:42.573 UTC 3935 FATAL: terminating background worker "pg_failover_slots worker" due to administrator command
12:32:42 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:32:42.626 UTC 2748 LOG: background worker "pg_failover_slots worker" (PID 3935) exited with exit code 1

At that time, all the normal clients were disconnected and we only see in the logs new connection attempts that were refused, eg:
12:32:42 db1 patroni[1846807]: patroni_postgres: 2025-07-16 12:32:42.796 UTC 1846807 LOG: connection received: host=xxxx:xxx:8::13 port=53704
12:32:42 db1 patroni[1846807]: patroni_postgres: 2025-07-16 12:32:42.800 UTC 1846807 FATAL: the database system is shutting down

12:32:43 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:32:43.259 UTC 2988 LOG: shutting down

We might have missed some logs that might be real patroni or postgres logs :-(
12:33:39 db1 systemd-journald[1613]: Suppressed 10375 messages from patroni.service

Usually the switchover only takes a few seconds. After waiting a few minutes we became anxious and started debugging.

Using "ps -Alf|grep postgres" we saw that there were no more normal client connections, but still 58 logical replicaton walsender processes and 6 streaming replication walsenders.
"top" revealed that the walsenders were eating CPU. I know for sure that I saw plenty logical replication walsenders that were eating 100% CPU, but I have not checked if also the streaming replication walsenders were eating CPU.
I also straced the checkpointer process, and that was doing no I/O, just looping or polling (I can't remember the details). Our Performance Monitoring showed a general drop of I/O activity during the shutdown.

So this sounds somehow like the walsenders were stuck in a state that in the postgres mailing lists is called "busy loop".

Then we decided to remove the VIP (which is used by the logical replication clients) from db1 to trigger replication errors:
systemctl stop vip-manager.service
12:50:11 db1 commandlog[2276907]: root/root : ip a d x.x.x.x/24 dev bond0.203
12:50:11 db1 commandlog[2276907]: root/root : ip -6 a d xxxx:xxx:x:x::10/64 dev bond0.203

Then, either caused by that or by something else, the connections timed out. We are using the default timeouts:
wal_sender_timeout: 60s

Frankly, the timeouts started not 60s after removing the VIP, but already after 45 seconds. But I guess the timeout triggered a bit eralier as there was no more traffic on the walsender connections except for the status keepalive from the clients.

12:50:55 db1 patroni[2586700]: patroni_postgres: 2025-07-16 12:50:55.977 UTC 2586700 LOG: terminating walsender process due to replication timeout
12:50:55 db1 patroni[2586700]: patroni_postgres: 2025-07-16 12:50:55.977 UTC 2586700 STATEMENT: START_REPLICATION SLOT "barman_sbg" 7DFD/5F000000 TIMELINE 22
12:50:56 db1 patroni[2579309]: patroni_postgres: 2025-07-16 12:50:56.027 UTC 2579309 LOG: terminating walsender process due to replication timeout
12:50:56 db1 patroni[2579309]: patroni_postgres: 2025-07-16 12:50:56.027 UTC 2579309 STATEMENT: START_REPLICATION SLOT "reg_tam1" LOGICAL 75D4/F7FDBB8 (proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[827455]: patroni_postgres: 2025-07-16 12:50:56.212 UTC 827455 LOG: terminating walsender process due to replication timeout
12:50:56 db1 patroni[827455]: patroni_postgres: 2025-07-16 12:50:56.212 UTC 827455 STATEMENT: START_REPLICATION SLOT "reg_lis1" LOGICAL 7C86/593393A0 (proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[3390803]: patroni_postgres: 2025-07-16 12:50:56.506 UTC 3390803 LOG: terminating walsender process due to replication timeout
12:50:56 db1 patroni[3390803]: patroni_postgres: 2025-07-16 12:50:56.506 UTC 3390803 STATEMENT: START_REPLICATION SLOT "reg_rey1" LOGICAL 77A5/205F5F0 (proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[137758]: patroni_postgres: 2025-07-16 12:50:56.672 UTC 137758 LOG: terminating walsender process due to replication timeout
12:50:56 db1 patroni[137758]: patroni_postgres: 2025-07-16 12:50:56.672 UTC 137758 STATEMENT: START_REPLICATION SLOT "reg_tlv1" LOGICAL 7DC2/CEAB27A0 (proto_version '4', origin 'any', publication_names '"mypublications"')
12:50:56 db1 patroni[2586624]: patroni_postgres: 2025-07-16 12:50:56.929 UTC 2586624 LOG: terminating walsender process due to replication timeout
12:50:56 db1 patroni[2586624]: patroni_postgres: 2025-07-16 12:50:56.929 UTC 2586624 STATEMENT: START_REPLICATION SLOT "barman" 7DFD/60000000 TIMELINE 22

...and additional 54 logical walsender that timed out and were terminated

12:51:05 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:51:05.902 UTC 2988 LOG: checkpoint starting: shutdown immediate
12:51:05 db1 patroni[2988]: patroni_postgres: 2025-07-16 12:51:05.926 UTC 2988 LOG: checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.021 s, sync=0.001 s, total=0.026 s; sync files=9, longest=0.001 s, average=0.001 s; distance=55 kB, estimate=1080354 kB; lsn=7E53/18554B28, redo lsn=7E53/18554B28
12:51:07 db1 patroni[2748]: patroni_postgres: 2025-07-16 12:51:07.159 UTC 2748 LOG: database system is shut down
12:51:07 db1 patroni[2389]: 2025-07-16 12:51:07,954 INFO: Got response from db2 https://db2:8008/patroni: {"state": "running", "postmaster_start_time": "2025-07-16 12:17:25.758501+00:00", "role": "replica", "server_version": 160009, "xlog": {"received_location": 138895355628448, "replayed_location": 138895355628448, "replayed_timestamp": "2025-07-16 12:32:42.142954+00:00", "paused": false}, "sync_standby": true, "timeline": 22, "dcs_last_seen": 1752670265, "database_system_identifier": "7386557262250076198", "patroni": {"version": "4.0.5", "scope": "regdns_patroni_cluster_1", "name": "db2"}}

After closing all walsenders (that used the VIP) due to timeout (58 logical senders and 2 streaming senders to barman), the database was shut down and the switchover happened.

I do not know for sure, but I think the walsenders to the patroni replicas, that were using the host IP and not the VIP, were still active wenn we removed the VIP. But it seems they have shut down nicely after all the other walsenders terminated due to timeout.

So what caused the hanging? We found bug https://www.postgresql.org/message-id/aD0Xvh4Tj_Vk7pCA%40paquier.xyz but it seems that only affected logical walsender in standby mode.

We can not reproduce it - after a few minutes we made another switchover back to db1 and this time it was done within a few seconds.

Do you have any ideas what could be the problem, how we could solve it and how to debug if it happens again?

Thanks
Klaus

--
Klaus Darilion, Head of Operations
nic.at GmbH, Jakob-Haringer-Straße 8/V
5020 Salzburg, Austria

Attachment Content-Type Size
postgresql-shutdown-hangs.txt text/plain 8.5 KB

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2025-07-21 13:35:11 Re: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU
Previous Message Adrian Klaver 2025-07-19 21:34:11 Re: Performance of JSON type in postgres