Re: planner support functions: handle GROUP BY estimates ?

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: planner support functions: handle GROUP BY estimates ?
Date: 2020-01-14 20:53:49
Message-ID: 20200114205349.m32jgrf77sbxsziy@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2020 at 03:12:21PM -0500, Tom Lane wrote:
>Justin Pryzby <pryzby(at)telsasoft(dot)com> writes:
>> On Sun, Dec 22, 2019 at 06:16:48PM -0600, Justin Pryzby wrote:
>>> On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
>>>> Tom implemented "Planner support functions":
>>>> https://www.postgresql.org/docs/12/xfunc-optimization.html
>>>> I wondered whether there was any consideration to extend that to allow
>>>> providing improved estimates of "group by". That currently requires manually
>>>> by creating an expression index, if the function is IMMUTABLE (which is not
>>>> true for eg. date_trunc of timestamptz).
>
>>> I didn't hear back so tried implementing this for date_trunc(). Currently, the
>>> ...
>>> If the input timestamps have (say) hourly granularity, rowcount will be
>>> *underestimated* by 3600x, which is worse than the behavior in master of
>>> overestimating by (for "day") 24x.
>
>While I don't have any objection in principle to extending the set of
>things planner support functions can do, it doesn't seem like the idea is
>giving you all that much traction for this problem. There isn't that much
>knowledge that's specific to date_trunc in this, and instead you've got a
>bunch of generic problems (that would have to be solved again in every
>other function's planner support).
>
>Another issue is that it seems like this doesn't compose nicely ---
>if the GROUP BY expression is "f(g(x))", how do f's support function
>and g's support function interact?
>
>The direction that I've been wanting to go in for this kind of problem
>is to allow CREATE STATISTICS on an expression, ie if you were concerned
>about the estimation accuracy for GROUP BY or anything else, you could do
>something like
>
>CREATE STATISTICS foo ON date_trunc('day', mod_time) FROM my_table;
>
>This would have the effect of cueing ANALYZE to gather stats on the
>value of that expression, which the planner could then use, very much
>as if you'd created an index on the expression. The advantages of
>doing this rather than making an index are
>
>(1) you don't have to pay the maintenance costs for an index,
>
>(2) we don't have to restrict it to immutable expressions. (Volatile
>expressions would have to be disallowed, if only because of fear of
>side-effects; but I think we could allow stable expressions just fine.
>Worst case problem is that the stats are stale, but so what?)
>
>With a solution like this, we don't have to solve any of the difficult
>problems of how the pieces of the expression interact with each other
>or with the statistics of the underlying column(s). We just use the
>stats if available, and the estimate will be as good as it'd be for
>a plain column reference.
>
>I'm not sure how much new infrastructure would have to be built
>for this. We designed the CREATE STATISTICS syntax to support
>this (partly at my insistence IIRC) but I do not think any of the
>existing plumbing is ready for it. I don't think it'd be very
>hard to plug this into ANALYZE or the planner, but there might be
>quite some work to be done on the catalog infrastructure, pg_dump,
>etc.
>
>cc'ing Tomas in case he has any thoughts about it.
>

Well, I certainly do thoughts about this - it's pretty much exactly what
I proposed yesterday in this thread:

https://www.postgresql.org/message-id/flat/20200113230008(dot)g67iyk4cs3xbnjju(at)development

The third part of that patch series is exactly about supporting extended
statistics on expressions, about the way you described here. The current
status of the WIP patch is that grammar + ANALYZE mostly works, but
there is no support in the planner. It's obviously still very hackish.

The main thing I'm not sure about is how to represent this in catalogs,
whether to have two fields (like for indexes) or maybe a single list of
expressions.

I'm also wondering if we could/should 100% rely on extended statistics,
because those are really meant to track correlations between columns,
which means we currently require at least two attributes in CREATE
STATISTICS and so on. So maybe what we want is collecting "regular"
per-column stats just like we do for indexes, but without the index
maintenance overhead?

The advantage would be we'd get exactly the same stats as for indexes,
and we could use them in the same places out of the box. While with
extended stats we'll have to tweak those places.

Now, the trouble is we can't store stuff in pg_statistic without having
a relation (i.e. table / index / ...) but maybe we could invent a new
relation type for this purpose. Of course, it'd require some catalog
work to represent this ...

Ultimately I think we'd want both things, it's not one or the other.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-14 21:09:55 Re: proposal: type info support functions for functions that use "any" type
Previous Message Stephen Frost 2020-01-14 20:45:17 Re: 12.1 not useable: clientlib fails after a dozen queries (GSSAPI ?)