Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)
Date: 2015-11-24 07:53:16
Message-ID: CAM3SWZSpdPB3uErnXWMt3q74y0r+84ZsOt2U3HKKes_V7O+0Qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 1:30 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> pg_stat_statements' fingerprinting logic considers the following two
> statements as distinct:
>
> select 1 in (1, 2, 3);
> select 1 in (1, 2, 3, 4);
>
> This is because the ArrayExpr jumble case jumbles any ArrayExpr's list
> of elements recursively. In this case it's a list of Const nodes, and
> the fingerprinting logic jumbles those nodes indifferently.
>
> Somebody told me that they think that pg_stat_statements should not do
> that. This person felt that it would be preferable for such
> expressions to be normalized without regard to the number of distinct
> Const elements. I suppose that that would work by determing if the
> ArrayExpr elements list was a list of Const nodes and only const
> nodes. Iff that turned out to be the case, something else would be
> jumbled (something other than the list) that would essentially be a
> representation of "some list of zero or more (or maybe one or more)
> Const nodes with consttype of, in this example, 23". I think that this
> would make at least one person happy, because of course the two
> statements above would have their costs aggregated within a single
> pg_stat_statements entry.

Baron Schwartz recently gave a talk at PGConf Silicon Valley about the
proprietary query instrumentation tool, VividCortex. The slides are
available from:

http://info.citusdata.com/rs/235-CNE-301/images/Analyzing_PostgreSQL_Network_Traffic_with_vc-pgsql-sniffer_-_Baron_Schwartz.pdf

One specific justification he gave for not using pg_stat_statements was:

"Doesn’t merge bind vars in IN()" (See slide #11)

His theory is that you should allow a proprietary binary to run with
root permissions, a binary that sniffs the wire protocol, mostly
because pg_stat_statements has this limitation (all other
pg_stat_statements limitations listed are pretty unconvincing, IMV).
That doesn't seem like a great plan to me, but I think he has a point
about pg_stat_statements. It's about time that we fixed this -- it
isn't realistic to imagine that people are going to know to use an
array constant like "= any ('{1,2,3}')" -- even a major contributor to
Django that I talked to about this issue a couple of years ago didn't
know about that. It also isn't portable across database systems.

I wonder:

* How do other people feel about this? Personally, I've seen enough
problems of this kind in the field that "slippery slope" arguments
against this don't seem very compelling.

* How might altering the jumbling logic to make it recognize a
variable number of constants as equivalent work in practice? Perhaps
we should do something to flatten the representation based on which
powers of two the number of constants is between. There are still some
details to work out there, but that's my first idea. That seems like a
good compromise between the current behavior, and completely
disregarding the number of constants.

--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2015-11-24 08:02:55 Re: [PROPOSAL] VACUUM Progress Checker.
Previous Message Magnus Hagander 2015-11-24 06:42:21 Re: New email address