Re: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU

From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To: Klaus Darilion <klaus(dot)darilion(at)nic(dot)at>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU
Date: 2025-07-21 13:35:11
Message-ID: 7b0ac868e97508aac3661a6a665fea00be2b923e.camel@cybertec.at
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 2025-07-21 at 10:47 +0000, Klaus Darilion wrote:
> (Note: I have also attached the whole email for better readability of the logs)

Your mail looks good enough the way it is:
https://postgr.es/m/DBAPR03MB6358854AD71C8ABA5CA10A8DF15DA%40DBAPR03MB6358.eurprd03.prod.outlook.com

> Our setup: 5 Node Patroni Cluster with PostgreSQL 16.9.
> db1: current leader
> db2: sync-replica
> db3/4/5: replica
>  
> The replicas connect to the leader using the host IP of the leader. So there are
> 4 walsender for patroni, 1 sync and 3 async.
>  
> The patroni cluster utilizes a service IP-address (VIP). The VIP is used by all
> clients connecting to the current leader. These clients are:
> - some web-apps doing normal DB queries (read/write)
> - 2 barman backup clients using streaming replication
> - 58 logical replication clients
>  
> Additionally we use https://github.com/EnterpriseDB/pg_failover_slots to sync and
> advance the logical replication slots on the replicas. The failover_slots plugin
> periodically connects to leader using the VIP.
>  
> We had a planned maintenance and wanted to switch the leader from db1 to db2:
> 12:32:18: patronictl switchover --leader db1 --candidate db2
>  
> So postmaster received the fast shutdown request from Patroni and started
> shutting down the client connection processes:
>  
> Usually the switchover only takes a few seconds. After waiting a few minutes
> we became anxious and started debugging.
>  
> Using "ps -Alf|grep postgres" we saw that there were no more normal client
> connections, but still 58 logical replicaton walsender processes and
> 6 streaming replication walsenders.
> "top" revealed that the walsenders were eating CPU.

We have had a somewhat similar report:
https://www.postgresql.org/message-id/flat/18985-64431d78bcabae95%40postgresql.org

What is the logical decoding plugin you are using?

If it is "pgoutput", what are the walsenders doing? You can try "strace" and
use "gdb" to break into the walsenders and take a stack trace.

Yours,
Laurenz Albe

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Greg Sabino Mullane 2025-07-21 19:26:37 Re: Request for Feedback on PostgreSQL HA + Load Balancing Architecture
Previous Message Klaus Darilion 2025-07-21 10:47:40 Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU