Troubleshooting a brief spike in active sessions

From: "Dirschel, Steve-CW" <Steve(dot)Dirschel(at)bestbuy(dot)com>
To: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Troubleshooting a brief spike in active sessions
Date: 2026-01-09 21:41:33
Message-ID: BL0PR06MB5009A7B383ACA34CD3603B058A82A@BL0PR06MB5009.namprd06.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Aurora Postgres version 17.4

This application mainly runs 3 queries very frequently- around 12000 executions per second across the 3. Normally the server handles this fine and at any given time there are between 0 - 10 active sessions running the queries. The 3 queries are tuned- doing 3-7 logical reads per execute and run in < .1 millisecond.

Periodically there will be a huge spike in active sessions up to 400+ sessions all on CPU (it can be higher depending on how many app servers are in the mix at the time). This spike will last 2-5 seconds but it causes an increase in app response time which is a problem.

We snap info from pg_stat_activity (where state <> ‘idle' ) every 1 second for troubleshooting issues and a couple observations from that snap data when this spike happens:

1.
The app sessions will show as wait_event_type and wait_event as null which I understand means they are on CPU.
2.
Some of those sessions will not have queryid populated. But all the app sessions do is execute queries.
3.
With aurora Postgres they store plan information in pg_stat_activity in planid column. For some sessions the queryid will be populated but the planid will not be populated. According to AWS this can happen if the session has completed parsing the query but has not finished plan generation or sessions are in the bind phase where parameter binding occurs before final plan selection.

With Aurora Postgres we do not have access to the server to look at the unix processes but we enabled enhanced monitoring in 1 second intervals which does provide a little info on session processes. One thing that stood out when the active session spike happens if I looked at the process list (it only shows ~ 95 of the processes) I would see a bunch like this:

"vss": 640900784,
"name": "postgres: prcoffrcnsmruser prcoffrcnsmr 172.29.145.159(53026) BIND",
"tgid": 26247,
"parentID": 584,
"memoryUsedPc": 0.02,
"cpuUsedPc": 0.06,
"id": 26247,
"rss": 30496,
"vmlimit": “unlimited"

Where they had BIND next to the process info (vs others will show things like idle or SELECT). In this case of the 95 processes shown it may have shown 59 of them were idle, 2 were SELECT, 2 were PARSE, and 32 were BIND.

So based on these processes showing BIND and pg_stat_activity either not showing a queryid or planid if it did show a queryid it seems these sessions may be getting hung up parsing the queries or specifically at the step it does binding.

Can anyone come up with any theories what may be going on here or is my description/explanation above too vague? The app is using prepared statements. The connection pool is HikariCP which does not support caching prepared statements. The thing that throws me off is so far I’m unable to see anything the app is doing to cause this. I don’t see a spike in query activity. I don’t see a spike in new connections from the pools. I don’t see anything leading up to this other than this spike in sessions out of nowhere. I don’t see any resource issues on the server. It will run this same load consistently throughout the day and the active session spike may happen 0 - 10 times in a day.

And ideas would be appreciated.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joshua Banton 2026-01-09 21:58:57 Re: Troubleshooting a brief spike in active sessions
Previous Message Pavel Stehule 2026-01-09 08:45:45 Re: proposal: schema variables