Re: planner support functions: handle GROUP BY estimates ?

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: planner support functions: handle GROUP BY estimates ?
Date: 2019-12-23 00:16:48
Message-ID: 20191223001648.GG30414@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
> Tom implemented "Planner support functions":
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
> https://www.postgresql.org/docs/12/xfunc-optimization.html
>
> I wondered whether there was any consideration to extend that to allow
> providing improved estimates of "group by". That currently requires manually
> by creating an expression index, if the function is IMMUTABLE (which is not
> true for eg. date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc(). Currently, the
planner assumes that functions output equally many groups as their input
variables. Most invocations of our reports use date_trunc (or similar), so my
earlier attempt to alert on rowcount misestimates was very brief.

I currently assume that the input data has 1 second granularity:
|postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM generate_series(now(), now()+'7 day'::interval, '1 seconds')a; ANALYZE t;
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| Group (cost=9021.85..9042.13 rows=169 width=8) (actual time=1365.934..1366.453 rows=169 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| Finalize HashAggregate (cost=10172.79..10298.81 rows=10081 width=8) (actual time=1406.057..1413.413 rows=10081 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| Group (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 rows=8 loops=1)

If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

I'm trying to think of ways to address that:

0) Add a fudge factor of 4x or maybe 30x;

1) Avoid applying a corrective factor for seconds or minutes that makes the
rowcount less than (say) 2 or 100. That would divide 24 but might then avoid
the last /60 or /60/60. Ultimately, that's more "fudge" than anything else;

2) Leave alone pg_catalog.date_trunc(), but provide "template" support
functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the
given corrective factor, which should allow more accurate rowcount for input
data with granularity of the given number of seconds.

Ideally, that would be user-specified factor, but I don't think that's possible
to specify in SQL; the constant has to be built into the C function. At
telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe
make a "date_trunc" function in the user schema which calls the
pg_catalog.date_trunc with support function timestamp_support_10pow3;

There could be a "base" support function that accepts a multiplier argument,
and then any user-provided C extension would be a one-liner specifing an
arbitrary value;

3) Maybe there are better functions than date_trunc() to address;

4) Leave it as a patch in the archives for people to borrow from;

Justin

Attachment Content-Type Size
v1-0001-Planner-support-functions-for-GROUP-BY-f.patch text/x-diff 11.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-12-23 00:24:09 Drongo vs. 9.4 initdb TAP test
Previous Message tsunakawa.takay@fujitsu.com 2019-12-22 23:38:43 RE: Implementing Incremental View Maintenance