Re: [Proposal] Add accumulated statistics for wait event

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: imai(dot)yoshikazu(at)fujitsu(dot)com, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "yoshikazu_i443(at)live(dot)jp" <yoshikazu_i443(at)live(dot)jp>
Subject: Re: [Proposal] Add accumulated statistics for wait event
Date: 2021-06-04 22:53:44
Message-ID: 20210605005344.4a751fe5@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

I faced a few times a situation where a long running query is actually
including the time the backend is waiting for the frontend to fetch all the
rows (see [1] for details). See a sample code fe-time.c and its comments in
attachment to reproduce this behavior.

There's no simple way today to pinpoint the problem in production without
advance interactive auditing, and/or using system tools. After studying the
problem, I believe it boil down to track the wait event ClientWrite, so I ended
up on this thread.

You might catch some mismatching times thanks to auto_explain as well. Using
the fe-time.c demo code with the following command:

PGDATABASE=postgres PGHOST=::1 time ./fe-time 100

The frontend time is 10s, the query time reported is 3228.631ms, but last row
has been produced after 20.672ms:

LOG: duration: 3228.631 ms plan:
Query Text: SELECT * FROM pgbench_accounts
Seq Scan on pgbench_accounts (time=0.005..20.672 rows=100000 loops=1)

(Note that in contrast with localhost, through the unix socket, the backend
reported query time is always really close to 10s).

I re-based the existing patch (see in attachment), to look at the ClientWrite
for this exact query:

# SELECT wait_event, calls, times
FROM pg_stat_get_waitaccum(NULL)
WHERE wait_event = 'ClientWrite';

wait_event | calls | times
-------------+-------+---------
ClientWrite | 4 | 3132266

The "time" is expressed as µs in the patch, so 3132.266ms of the total
3228.631ms is spent sending the result to the frontend. I'm not sure where are
the missing 75ms.

The pg_wait_sampling extension might help but it requires a production restart
to install, then enable it. Whatever if the solution is sampling or cumulative,
an in-core and hot-switchable solution would be much more convenient. But
anyway, looking at pg_wait_sampling, we have a clear suspect as well for the
later query run:

# SELECT event, count
FROM pg_wait_sampling_profile
WHERE queryid=4045741516911800313;

event | count
-------------+-------
ClientWrite | 309

The default profil period of pg_wait_sampling being 10ms, we can roughly
estimate the ClientWrite around 3090ms. Note that this is close enough because
we know 3132266µs has been accumulated among only 4 large wait events.

Finishing bellow.

On Mon, 3 Aug 2020 00:00:40 +0200
Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:

> > On 31 Jul 2020, at 07:23, imai(dot)yoshikazu(at)fujitsu(dot)com wrote:
> >
> >> This patch fails to apply to HEAD, please submit a rebased version. I've
> >> marked this as as Waiting on Author.

Please, find in attachment a rebase of both patches. I did some small editing
on the way. I didn't bench them.

I'm not sure this patch is the best approach though. Receive it as a
motivation to keep up with this discussion. As I wrote, whatever if the
solution is sampling or cumulative, an in-core and hot-switchable solution
would be much more convenient. The fact is that this patch was already
available and ready to keep up with a discussion.

Collecting and summing all wait events from all backends in the same place
forbid to track precisely wait events from a specific backends. Especially on a
busy system where numbers can quickly be buried by all other activities around.

I wonder if wait events should only be accumulated on backend side, making
possible to enable/disable them on the fly and to collect some reports eg. in
logs or to output. Most of the code from these patch could be recycled in a
simpler patch implementing this.

Thoughts?

> > Sorry for my absence. Unfortunately I couldn't have time to work on this
> > patch in this cf. I believe I will be back in next cf, work on this patch
> > and also review other patches.
>
> No worries, it happens. Since the thread has stalled and there is no updated
> patch I've marked this entry Returned with Feedback. Please feel free to
> re-open a new CF entry if you return to this patch.

I volunteer to be a reviewer on this patch.

Imai-san, do you agree to add it as new CF entry?

Regards,

[1]: Last time I had such situation was few weeks ago. A customer was
reporting a query being randomly slow, running bellow 100ms most of the time
and sometime hitting 28s. Long story short, the number of row was the same
(10-15k), but the result set size was 9x bigger (1MB vs 9MB). As the same query
was running fine from psql, I suspected the frontend was somehow saturated.
Tcpdump helped me to compute that the throughput fall to 256kB/s after the
first 2MB of data transfert with a very narrow TCP window. I explained to the
customer their app probably doesn't pull the rows fast enough and that
some buffers were probably saturated on the frontend side, waiting for
the app and slowing down the whole transfert.
Devels fixed the problem by moving away two fields transformations (unaccent)
from their loop fetching the rows.

Attachment Content-Type Size
0001-Add-pg_stat_waitaccum-view-v7.patch text/x-patch 32.3 KB
0002-POC-Change-measuring-method-of-wait-event-time-from-v7.patch text/x-patch 4.8 KB
fe-time.c text/x-c++src 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oh, Mike 2021-06-04 22:59:22 Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Previous Message Tom Lane 2021-06-04 22:51:00 Re: Move pg_attribute.attcompression to earlier in struct for reduced size?