Normalization of utility queries in pg_stat_statements

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject: Normalization of utility queries in pg_stat_statements
Date: 2023-02-08 03:05:24
Message-ID: Y+MRdEq9W9XVa2AB@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,
(Adding Bertrand in CC.)

$subject is a follow-up of the automation of query jumbling for
utilities and DDLs, and attached is a set of patches that apply
normalization to DDL queries across the board, for all utilities.

This relies on tracking the location of A_Const nodes while removing
from the query jumbling computation the values attached to the node,
as as utility queries can show be stored as normalized in
pg_stat_statements with some $N parameters. The main case behind
doing that is of course monitoring, where we have seen some user
instances willing to get more information but see pg_stat_statements
as a bottleneck because the query ID of utility queries are based on
the computation of their string, and is value-sensitive. That's the
case mentioned by Bertrand Drouvot for CALL and SET where workloads
full of these easily bloat pg_stat_statements, where we concluded
about more automation in this area (so here it is):
https://www.postgresql.org/message-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7%40amazon.com

For example, this makes possible the following grouping:
- CALL func(1,2); CALL func(1,3); => CALL func($1,$2)
- EXPLAIN SELECT 1; EXPLAIN SELECT 1; => EXPLAIN SELECT $1;
- CREATE MATERIALIZED VIEW aam AS SELECT 1; becomes "CREATE
MATERIALIZED VIEW aam AS SELECT $1".

Query jumbling for DDLs and utilities happens now automatically, still
are not represented correctly in pg_stat_statements (one bit of
documentation I missed previously refers to the fact that these depend
on their query strings, which is not the case yet).

By the way, while looking at all that, I have really underestimated
the use of Const nodes in utilities, as some queries can finish with
the same query ID even if different values are stored in a query,
still don't show up as normalized in pg_stat_statements, so the
current state of HEAD is not good, though you would need to use the
same object name to a conflict for most of them. So that's my mistake
here with 3db72eb. If folks think that we'd better have a revert of
this automated query jumbling facility based on this argument, that
would be fine for me, as well. The main case I have noticed in this
area is EXPLAIN, by the way. Note that it is actually easy to move to
the ~15 approach of having a query ID depending on the Const node
values for DDLs, by having a custom implementation in
queryjumblefuncs.c for Const nodes, where we apply the constant value
and don't store a location for normalization if a query has a utility
once this information is stored in a JumbleState.

This rule influences various DDLs, as well, once it gets applied
across the board, and it's been some work to identify all of them, but
I think that I have caught them all as the regression database offers
all the possible patterns:
- CREATE VIEW, CTAS, CREATE MATERIALIZED VIEW which have Const nodes
depending on their attached queries, for various clauses.
- ALTER TABLE/INDEX/FOREIGN with DEFAULT, SET components.
- CREATE TABLE with partition bounds.
- BEGIN and ABORT, with transaction commands getting grouped
together.

The attached patch set includes as a set of regression tests for
pg_stat_statements for *all* the utility queries that have either
Const or A_Const nodes, so as one can see the effect that all this
stuff has. This is based on a diff of the contents of
pg_stat_statements on the regression database once all these
normalization rules are applied.

Compilation of a Const can also be made depending on the type node.
However, all that makes no sense if applying the same normalization
rules to all the queries across the board, because all the queries
would follow the same rules. That's the critical bit IMO. From what
I get, the bloat of pg_stat_statements for all utilities is something
that would be helpful for all such queries, still different things
could be done on a per-node basis. Perhaps this is too aggressive as
it is and people don't like it, though, so feedback is welcome. I'd
like to think that maximizing grouping is nice though, because it
leads to no actual loss of information on the workload pattern for the
queries involved, AFAIU. This sentence may be overoptimistic.

So, attached is a patch set, that does the following:
- 0001 is a refactoring of the regression tests of
pg_stat_statements by splitting a bit the tests. I bumped into that
while getting confused at how the tests are now when it comes to the
handling of utilities and track_planning, where these tests silently
rely on other parts of the same file with different GUC settings.
This refactoring is useful on its own, IMO, and the tests show the
same output as previously.
- 0002 is the addition of tests in pg_stat_statements for all the DDL
and utility patterns that make use of Const and A_Const nodes. Even
if query jumbling of utilities is done through their text string or
their nodes, this is also useful.
- 0003 is the code of the feature, that switches pg_stat_statements to
properly normalize utility queries, with a modification to A_Const so
as normalization can be applied to it. With the generation of the
code for query jumbling being automated based on the node definitions,
this is straight-forward as a code change, but the changes are
basically impossible to track without all the patterns tracked by
0002.

Thoughts and comments are welcome. 0001 and 0002 are useful on their
own to keep track of utilities that use Const and A_Const after going
through the query jumbling, even if an approach based on query string
or the automated query jumbling for utilities is used (the query
string approach a bit its value). I'll add that to the next commit
fest.

Thanks,
--
Michael

Attachment Content-Type Size
0001-Refactor-regression-tests-of-pg_stat_statements.patch text/x-diff 31.4 KB
0002-Add-more-test-for-utility-queries-in-pg_stat_stateme.patch text/x-diff 30.1 KB
0003-Apply-normalization-to-A_Const-and-utilities-in-pg_s.patch text/x-diff 20.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-02-08 03:26:01 Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Previous Message wangw.fnst@fujitsu.com 2023-02-08 03:01:03 RE: Perform streaming logical transactions by background workers and parallel apply