Re: Add min and max execute statement time in pg_stat_statement

From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Geoghegan <pg(at)heroku(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add min and max execute statement time in pg_stat_statement
Date: 2013-10-22 09:56:47
Message-ID: m2eh7dr5ls.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Hm. It's been a long time since college statistics, but doesn't the
> entire concept of standard deviation depend on the assumption that the
> underlying distribution is more-or-less normal (Gaussian)? Is there a

I just had a quick chat with a statistician friends of mine on that
topic, and it seems that the only way to make sense of an average is if
you know already the distribution.

In our case, what I keep experiencing with tuning queries is that we
have like 99% of them running under acceptable threshold and 1% of them
taking more and more time.

In a normal (Gaussian) distribution, there would be no query time
farther away from the average than any other, so my experience tells me
that the query time distribution is anything BUT normal (Gaussian).

> good reason to suppose that query runtime is Gaussian? (I'd bet not;
> in particular, multimodal behavior seems very likely due to things like
> plan changes.) If not, how much does that affect the usefulness of
> a standard-deviation calculation?

I don't know what multi-modal is.

What I've been gathering from my quick chat this morning is that either
you know how to characterize the distribution and then the min max and
average are useful on their own, or you need to keep track of an
histogram where all the bins are of the same size to be able to learn
what the distribution actually is.

We didn't get to the point where I could understand if storing histogram
with a constant size on log10 of the data rather than the data itself is
going to allow us to properly characterize the distribution.

The main question I want to answer here would be the percentiles one, I
want to get the query max execution timing for 95% of the executions,
then 99%, then 99.9% etc. There's no way to answer that without knowing
the distribution shape, so we need enough stats to learn what the
distribution shape is (hence, histograms).

Of course keeping enough stats seems to always begin with keeping the
min, max and average, so we can just begin there. We would just be
unable to answer interesting questions with just that.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Farina 2013-10-22 10:11:14 Re: Add min and max execute statement time in pg_stat_statement
Previous Message Albe Laurenz 2013-10-22 09:39:07 Re: Getting Tuples list from an index