From: | Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> |
---|---|
To: | Klaus Darilion <klaus(dot)darilion(at)nic(dot)at>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU |
Date: | 2025-07-21 13:35:11 |
Message-ID: | 7b0ac868e97508aac3661a6a665fea00be2b923e.camel@cybertec.at |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, 2025-07-21 at 10:47 +0000, Klaus Darilion wrote:
> (Note: I have also attached the whole email for better readability of the logs)
Your mail looks good enough the way it is:
https://postgr.es/m/DBAPR03MB6358854AD71C8ABA5CA10A8DF15DA%40DBAPR03MB6358.eurprd03.prod.outlook.com
> Our setup: 5 Node Patroni Cluster with PostgreSQL 16.9.
> db1: current leader
> db2: sync-replica
> db3/4/5: replica
>
> The replicas connect to the leader using the host IP of the leader. So there are
> 4 walsender for patroni, 1 sync and 3 async.
>
> The patroni cluster utilizes a service IP-address (VIP). The VIP is used by all
> clients connecting to the current leader. These clients are:
> - some web-apps doing normal DB queries (read/write)
> - 2 barman backup clients using streaming replication
> - 58 logical replication clients
>
> Additionally we use https://github.com/EnterpriseDB/pg_failover_slots to sync and
> advance the logical replication slots on the replicas. The failover_slots plugin
> periodically connects to leader using the VIP.
>
> We had a planned maintenance and wanted to switch the leader from db1 to db2:
> 12:32:18: patronictl switchover --leader db1 --candidate db2
>
> So postmaster received the fast shutdown request from Patroni and started
> shutting down the client connection processes:
>
> Usually the switchover only takes a few seconds. After waiting a few minutes
> we became anxious and started debugging.
>
> Using "ps -Alf|grep postgres" we saw that there were no more normal client
> connections, but still 58 logical replicaton walsender processes and
> 6 streaming replication walsenders.
> "top" revealed that the walsenders were eating CPU.
We have had a somewhat similar report:
https://www.postgresql.org/message-id/flat/18985-64431d78bcabae95%40postgresql.org
What is the logical decoding plugin you are using?
If it is "pgoutput", what are the walsenders doing? You can try "strace" and
use "gdb" to break into the walsenders and take a stack trace.
Yours,
Laurenz Albe
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Sabino Mullane | 2025-07-21 19:26:37 | Re: Request for Feedback on PostgreSQL HA + Load Balancing Architecture |
Previous Message | Klaus Darilion | 2025-07-21 10:47:40 | Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU |