Re: [PATCH] optional cleaning queries stored in pg_stat_statements

From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Peter Geoghegan" <peter(at)2ndquadrant(dot)com>, "Tomas Vondra" <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH] optional cleaning queries stored in pg_stat_statements
Date: 2011-11-06 15:57:50
Message-ID: 214eea0a7c532feeb959b7ae7169d1d7.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6 Listopad 2011, 16:08, Tom Lane wrote:
> Peter Geoghegan <peter(at)2ndquadrant(dot)com> writes:
>> I'm a couple of days away from posting a much better principled
>> implementation of pg_stat_statements normalisation. To normalise, we
>> perform a selective serialisation of the query tree, which is hashed.
>
> That seems like an interesting approach, and definitely a lot less of
> a kluge than what Tomas suggested. Are you intending to hash the raw
> grammar output tree, the parse analysis result, the rewriter result,
> or the actual plan tree? I don't necessarily have a preconceived notion
> about which is best (I can think of pluses and minuses for each), but
> we should hear your explanation of which one you chose and what your
> reasoning was.

Could you describe the pluses and minuses? My understanding is that the
later the hash is computed, the more it reflects how the queries were
actually executed. Would it make sense to turn this into a GUC and leave
the decision up to the user, something like

pg_stat_statements.hash_phase = {grammar|analysis|rewriter|plan}

So that the user could decide how coarse the output should be?

> I'm not real sure whether it's better to classify on the basis of
> similar plans or similar original queries, anyway. This seems like
> something that could be open for debate about use-cases.

Well, that's really tricky question - there can be different queries with
the same plan. I thing that grouping queries solely by the plan is not
much useful, so the original query should be involved somehow.

What about using two hashes - hash of the grammar tree (grammar_hash) and
hash of the rewriter output (rewriter_hash). The pg_stat_statements would
then group the queries by the (grammar_hash, rewriter_hash) pair and
include those two columns into the output.

So I could select the rows with the same grammar_hash to see observed
plans for the given query, or select rows with the same rewriter_hash to
see queries leading to that particular plan.

To make this actually usable it's important to provide access to the
plans, so that the user can get rewriter_hash and get the plan somehow.
This is not needed for grammar_hash, because the query string will be
there, but the actual plan might change (due to autoanalyze etc.).

But maybe this is a severe over-engineering and it's far too complicated.

> It might be that the path of least resistance is to document that we
> select one of the actual query strings "at random" to represent all the
> queries that went into the same hash entry, and not even bother with
> trying to strip out constants. The effort required to do that seems
> well out of proportion to the reward, if we can't do a perfect job of
> representing the common aspects of the queries.

Yes, once we have the hashes we can surely use a random query string with
the values included. But it'd be nice to have the actual plans stored
somewhere, so that it's possible to see them later.

Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-11-06 16:54:37 Re: Strange problem with create table as select * from table;
Previous Message Tom Lane 2011-11-06 15:08:21 Re: [PATCH] optional cleaning queries stored in pg_stat_statements