Extending pg_stat_statements to expose queryid

From: Sameer Thakur <samthakur74(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Extending pg_stat_statements to expose queryid
Date: 2013-09-05 09:44:55
Message-ID: CABzZFEv_XaXu_4T8m3SCSvxSg_wcHB9RXOJrg5kmRpiiRrwYTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello All,

I am trying to revive the discussion about exposing queryid in
pg_stat_statements.

I did find the same request posted on hackers @
http://www.postgresql.org/message-id/CABUevExLnb6xJwS=8rTaLOfDOS-tFm09==Z2M_vz5hhFkgWBTg@mail.gmail.com

and

http://www.postgresql.org/message-id/CACN56+NLMTwHg8eQQqNYzqe2Q0nEGJoKmGFiUSK_aoHw627Q8Q@mail.gmail.com

From the discussions I concluded

1. The main use case for exposing queryid, is it being a better substitute
to hashing the query text of a pg_stat_statements snapshot, to make a
candidate key. Problems occur when hash value should be different even if
query text is same. For example when a table referred in a query is dropped
and recreated or when the query text is same on different schemas and
schema name is not included in query text.

2. Exposing queryid was proposed earlier but was not accepted. The main
reason was that queryid could be unstable as well. Since queryid was
derived from hashing query tree and query tree could undergo changes
between minor PostgreSQL releases, meant the queryid for same query could
be different between releases, resulting in incorrect statement statistics
collection.

3. Another problem is to distinguish between queries whose statistics are
continuously maintained and queries which are intermittent, whose
statistics might be silent reset, without the reporting tool being wiser.

4. A solution to avoid misrepresentation of intermittent queries as
consistent queries would be to assign a unique number to each new row and
once that row is discarded, the unique number cannot be reused. The
drawbacks here is possible collision of unique values generated.

5. A patch implementing solution for identifying intermittent query is @
https://github.com/fdr/postgres/branches/error-prop-pg_stat_statements-v2.

The solution avoids using a counter, and achieves the same result by the
property that intermittent queries accumulate errors due to eviction from
hashtable while consistent queries do not. Error accumulation would be the
parameter by which a reporting tool can figure out if there was eviction of
queries between snapshots.

6. To address the problem of unstable queryid generated from query tree, it
was proposed to eliminate any possible misunderstanding that queryid will
remain the same between releases, by xoring the hash from query tree with
statistics sessionid. This also helps in all cases where the statistics
file is reset like crash recovery,recovery mode, ensuring a new hash value
for reset statistics.

To avoid increasing the chance of collision, a longer session key and
padding the queryid can be done to complete the XOR. Implementation of this
is @
https://github.com/fdr/postgres/branches/pg_stat_statements-identification-v3

7. The patch pg_stat_statements-identification-v3 was returned with
feedback for more documentation in commitfest 2013-01.

Questions:

1. Is there a plan to re-introduce this patch? The code seems to be
documented.

2. There was mention of further testing of error propagation using hooks.
Could this be elaborated?

3. There was a use case that exposing queryid could be used to aggregate
statistics across WAL based replication clusters. But now that queryid is
derived from statistics session id, which is randomly generated, this use
case is still not addressed. Is this correct?

Regards

Sameer

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2013-09-05 10:29:23 Re: Analysis on backend-private memory usage (and a patch)
Previous Message Pavel Stehule 2013-09-05 08:42:29 Re: Improving avg performance for numeric