Re: Hash id in pg_stat_statements

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash id in pg_stat_statements
Date: 2012-11-15 13:10:23
Message-ID: CABUevEwW0iQs1kY8qfh6h=01AHWGeemEn=Cr_Wc6+PEKf6bS5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 2, 2012 at 8:22 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> On 2 October 2012 18:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>> 1. Why isn't something like md5() on the reported query text an equally
>> good solution for users who want a query hash?
>
> Because that does not uniquely identify the entry. The very first
> thing that the docs say on search_path is "Qualified names are tedious
> to write, and it's often best not to wire a particular schema name
> into applications anyway". Presumably, the reason it's best not to
> wire schema names into apps is because it might be useful to modify
> search_path in a way that dynamically made the same queries in some
> application reference what are technically distinct relations. If
> anyone does this, and it seems likely that many do for various
> reasons, they will be out of luck when using some kind of
> pg_stat_statements aggregation.
>
> This was the behaviour that I intended for pg_stat_statements all
> along, and I think it's better than a solution that matches query
> strings.
>
>> 2. If people are going to accumulate stats on queries over a long period
>> of time, is a 32-bit hash really good enough for the purpose? If I'm
>> doing the math right, the chance of collision is already greater than 1%
>> at 10000 queries, and rises to about 70% for 100000 queries; see
>> http://en.wikipedia.org/wiki/Birthday_paradox
>> We discussed this issue and decided it was okay for pg_stat_statements's
>> internal hash table, but it's not at all clear to me that it's sensible
>> to use 32-bit hashes for external accumulation of query stats.
>
> Well, forgive me for pointing this out, but I did propose that the
> hash be a 64-bit value (which would have necessitated adopting
> hash_any() to produce 64-bit values), but you rejected the proposal. I
> arrived at the same probability for a collision as you did and posted
> in to the list, in discussion shortly after the normalisation stuff
> was committed.

What was the argument for rejecting it? Just that it required
hash_any() to be adapted?

Now that we have a very clear use case where this would help, perhaps
it's time to re-visit this proposal?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-11-15 13:28:46 Re: Hash id in pg_stat_statements
Previous Message Amit Kapila 2012-11-15 12:55:37 Re: Switching timeline over streaming replication