Re: Hash id in pg_stat_statements

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash id in pg_stat_statements
Date: 2012-10-02 13:15:46
Message-ID: CAEYLb_UYZtizMCD0g+2txhC7vdeGA9-9XjCs-f=gPvgx0_+5aA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1 October 2012 18:05, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> * Peter Geoghegan (peter(at)2ndquadrant(dot)com) wrote:
>> That won't really help matters. There'd still be duplicate entries,
>> from before and after the change, even if we make it immediately
>> obvious which is which. The only reasonable solution in that scenario
>> is to bump PGSS_FILE_HEADER, which will cause all existing entries to
>> be invalidated.
>
> You're going to have to help me here, 'cause I don't see how there can
> be duplicates if we include the PGSS_FILE_HEADER as part of the hash,
> unless we're planning to keep PGSS_FILE_HEADER constant while we change
> what the hash value is for a given query, yet that goes against the
> assumptions that were laid out, aiui.

Well, they wouldn't be duplicates if you think that the fact that one
query was executed before some point release and another after ought
to differentiate queries. I do not.

> If there's a change that results in a given query X no longer hashing to
> a value A, we need to change PGSS_FILE_HEADER to invalidate statistics
> which were collected for value A (or else we risk an independent query Y
> hashing to value A and ending up with completely invalid stats..).
> Provided we apply that consistently and don't reuse PGSS_FILE_HEADER
> values along the way, a combination of PGSS_FILE_HEADER and the hash
> value for a given query should be unique over time.
>
> We do need to document that the hash value for a given query may
> change..

By invalidate, I mean that when we go to open the saved file, if the
header doesn't match, the file is considered corrupt, and we simply
log that the file could not be read, before unlinking it. This would
be necessary in the unlikely event of there being some substantive
change in the representation of query trees in a point release. I am
not aware of any precedent for this, though Tom said that there was
one.

Tom: Could you please describe the precedent you had in mind? I would
like to better understand this risk.

I don't want to get too hung up on what we'd do if this problem
actually occurred, because that isn't what this thread is about.
Exposing the hash is a completely unrelated question, except that to
do so might make pg_stat_statements (including its limitations) better
understood. In my opinion, the various log parsing tools have taught
people to think about this in the wrong way - the query string is just
a representation of a query (and an imperfect one at that).

There are other, similar tools that exist in proprietary databases.
They expose a hash value, which is subject to exactly the same caveats
as our own. They explicitly encourage the type of aggregation by
third-party tools that I anticipate will happen with
pg_stat_statements. I want to facilitate this, but I'm confident that
the use of (dbid, userid, querytext) as a "candidate key" by these
tools is going to cause confusion, in subtle ways. By using the hash
value, those tools are exactly as well off as pg_stat_statements is,
which is to say, as well off as possible.

I simply do not understand objections to the proposal. Have I missed something?

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Scott Corscadden 2012-10-02 13:41:50 MVCC and large objects
Previous Message Pavel Stehule 2012-10-02 11:09:07 Re: ToDo: allow to get a number of processed rows by COPY statement [Review of Patch]