Re: [Proposal] Add accumulated statistics for wait event

From: Bertrand DROUVOT <bdrouvot(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Phil Florent <philflorent(at)hotmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [Proposal] Add accumulated statistics for wait event
Date: 2018-07-26 10:05:33
Message-ID: CAGeEyRVBq=zrOY3sZihOet5ey7XHshEnZODL-244=65xwHm=3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Guys,

As you mentioned Oracle like active session history sampling in this
thread, I just want to let you know that I am working on a brand new
extension to provide this feature.

You can find the extension here: https://github.com/pgsentinel/pgsentinel

Basically, you could see it as samplings of pg_stat_activity (one second
interval as default) currently providing more information:

ash_time: the sampling time
blockers: the number of blockers
blockerpid: the pid of the blocker (if blockers = 1), the pid of
one blocker (if blockers > 1)
top_level_query: the top level statement (in case PL/pgSQL is used)
query: the statement being executed (not normalised, as it is in
pg_stat_statements, means you see the values)
cmdtype: the statement type
(SELECT,UPDATE,INSERT,DELETE,UTILITY,UNKNOWN,NOTHING)
queryid: the queryid of the statement (the one coming from
pg_stat_statements)

Thanks to the queryid field you are able to link the session activity with
the sql activity.

It's implemented as in-memory ring buffer where samples are written with
given (configurable) period.
Therefore, user can see some number of recent samples depending on history
size (configurable).

Current caveats: In case of high query rate per pid, you could see (I saw
it at more than 1000 queries per second) top_level_query and query not
"correlated" (query, queryid and cmdtype are still well linked together).
This is due to the fact that those 2 informations are currently collected
independently.

If you want to have a look, give your thoughts, you are welcome.

Bertrand

On 26 July 2018 at 03:24, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Jul 24, 2018 at 04:23:03PM +0000, Phil Florent wrote:
> > It loses non meaningful details and it's in fact a good point. In this
> > example, sampling will definitely find the cause and won't cost
> > resources.
>
> The higher the sampling frequency, the more details you get, with the
> most load on the instance. So if you are able to take an infinity of
> samples, where registering multiple times the same event for the same
> backend also matters because its overall weight gets higher and it shows
> up higher in profiles, then you would be able converge to the set of
> results that this patch adds. Sampling method, especially its
> frequency, is something controlled by the client and not the server.
> Approaches like the one proposed here push the load on the server-side,
> unconditionally, for *all* backends, and this has its cost.
>
> Even if you have spiky workloads, sampling may miss those, but even with
> adding counters for each event you would need to query the table holding
> the counters at an insane frequency to be able to perhaps get something
> out of it as you need to do sampling of the counters as well to extract
> deltas.
>
> As Tomas has mentioned up-thread, sampling is light-weight, as-is the
> current design for wait events. Even if it is not perfect because it
> cannot give exact numbers, it would find bottlenecks in really most
> cases, and that's what matters. If not, increasing the sampling
> frequency makes things easier to detect as well. What would be the
> point of taking only one sample every checkpoint for example?
>
> There may be a benefit in having counters, I don't know the answer to
> that, though the point would be to make sure that there is a specific
> set of workloads where it makes sense, still my gut feeling is that
> sampling would be able to detect those anyway.
>
> (I am not a computer scientist by default but a physicist, think fluid
> dynamics and turbulence, and I had my load of random events and signal
> analysis as well. All that is just statistics with attempts to approach
> reality, where sampling is a life-saver over exactitude of
> measurements.)
>
> Adding hooks is not acceptable to me either, those have a cost, and it
> is not clear what's the benefit we can get into putting hooks in such a
> place for cases other than sampling and counters...
> --
> Michael
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajkumar Raghuwanshi 2018-07-26 10:10:01 negative bitmapset member not allowed Error with partition pruning
Previous Message Kyotaro HORIGUCHI 2018-07-26 10:05:11 Re: Temporary tables prevent autovacuum, leading to XID wraparound