Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, y sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections
Date: 2022-07-07 18:58:06
Message-ID: 20220707185806.GC13040@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 30, 2022 at 11:05:28PM +0300, Michael Zhilin wrote:
> I would like to submit patch for column wait_event of view pg_stat_activity.
> Please find it attached.

Thanks. I had this thread on my list to look at later but then, by chance, I
got an erroneous nagios notification about a longrunning transaction, which
seems to have been caused by this bug.

You can reproduce the problem by running a couple handsful of these:

psql -h /tmp postgres -Atc "SELECT * FROM pg_stat_activity WHERE state='active' AND wait_event='ClientRead' LIMIT 1"&

> As result backend wait events may be changed for quick queries.

Yep. I realize now that I've been seeing this at one of our customers for a
pkey lookup query. We don't use high max_connections, but if you have a lot of
connections, you might be more likely to hit this.

I agree that this is a bug, since it can (and did) cause false positives in a
monitoring system.

> As well, patch changes way to allocate memory for local structure. Before it
> estimates maximum size of required memory and allocate it at once. It could
> result into allocation of dozens/hundreds of megabytes for nothing. Now it
> allocates memory by chunks to reduce overall amount of allocated memory and
> reduce time for allocation.

I suggest to present this as two patches: a 0001 patch to fix the bug, and
proposed for backpatch, and an 0002 patch for master to improve memory usage.
As attached. Actually, once 0001 is resolved, it may be good to start a
separate thread for 0002. I plan to add to the next CF.

Did you really experience latency before the patch ? It seems like
max_connections=1000 would only allocate a bit more than 1MB.

For me, max_connections=10000 makes pg_stat_activity 10% slower for me than 100
(11.8s vs 13.2s). With the patch, 10k is only about ~3% slower than 100. So
there is an improvement, but people here may not be enthusiastic excited about
improving performance for huge values of max_connections.

time for a in `seq 1 99`; do psql -h /tmp postgres -Atc "SELECT FROM pg_stat_activity -- WHERE state='active' AND wait_event='ClientRead' LIMIT 1"; done

Is there a reason why you made separate allocations for appname, hostname,
activity, ssl, gss (I know that's how it's done originally, too) ? Couldn't it
be a single allocation ? You could do a single (re)allocation when an active
backend is found if the amount of free space is less than
2*NAMEDATALEN+track_activity_query_size+sizeof(PgBackendSSLStatus)+sizeof(PgBackendGSSStatus)
I have a patch for that, but I'll share it after 0001 is resolved.

Why did you change backendStatus to a pointer ? Is it to minimize the size of
the structure to allow for huge values of max_connections ?

Note that there were warnings from your 0002:
backend_status.c:723:21: warning: ‘localactivity_thr’ may be used uninitialized in this function [-Wmaybe-uninitialized]

Thanks,
--
Justin

Attachment Content-Type Size
0001-fix-column-wait_event-of-view-pg_stat_activity.patch text/x-diff 4.9 KB
0002-wip-pg_stat_activity-allocate-memory-differently.patch text/x-diff 9.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-07-07 19:00:09 Re: pg_parameter_aclcheck() and trusted extensions
Previous Message Dmitry Koval 2022-07-07 18:44:35 Re: enable/disable broken for statement triggers on partitioned tables