Quick Links

Re: planner support functions: handle GROUP BY estimates ?

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: planner support functions: handle GROUP BY estimates ?
Date:	2019-12-23 00:16:48
Message-ID:	20191223001648.GG30414@telsasoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Nov 19, 2019 at 01:34:21PM -0600, Justin Pryzby wrote:
> Tom implemented "Planner support functions":
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a391ff3c3d418e404a2c6e4ff0865a107752827b
> https://www.postgresql.org/docs/12/xfunc-optimization.html
>
> I wondered whether there was any consideration to extend that to allow
> providing improved estimates of "group by". That currently requires manually
> by creating an expression index, if the function is IMMUTABLE (which is not
> true for eg. date_trunc of timestamptz).

I didn't hear back so tried implementing this for date_trunc(). Currently, the
planner assumes that functions output equally many groups as their input
variables. Most invocations of our reports use date_trunc (or similar), so my
earlier attempt to alert on rowcount misestimates was very brief.

I currently assume that the input data has 1 second granularity:
|postgres=# CREATE TABLE t(i) AS SELECT date_trunc('second',a)a FROM generate_series(now(), now()+'7 day'::interval, '1 seconds')a; ANALYZE t;
|postgres=# explain analyze SELECT date_trunc('hour',i) i FROM t GROUP BY 1;
| Group (cost=9021.85..9042.13 rows=169 width=8) (actual time=1365.934..1366.453 rows=169 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('minute',i) i FROM t GROUP BY 1;
| Finalize HashAggregate (cost=10172.79..10298.81 rows=10081 width=8) (actual time=1406.057..1413.413 rows=10081 loops=1)
|
|postgres=# explain analyze SELECT date_trunc('day',i) i FROM t GROUP BY 1;
| Group (cost=9013.71..9014.67 rows=8 width=8) (actual time=1582.998..1583.030 rows=8 loops=1)

If the input timestamps have (say) hourly granularity, rowcount will be
*underestimated* by 3600x, which is worse than the behavior in master of
overestimating by (for "day") 24x.

I'm trying to think of ways to address that:

0) Add a fudge factor of 4x or maybe 30x;

1) Avoid applying a corrective factor for seconds or minutes that makes the
rowcount less than (say) 2 or 100. That would divide 24 but might then avoid
the last /60 or /60/60. Ultimately, that's more "fudge" than anything else;

2) Leave alone pg_catalog.date_trunc(), but provide "template" support
functions like timestamp_support_10pow1, 10pow2, 10pow3, etc, which include the
given corrective factor, which should allow more accurate rowcount for input
data with granularity of the given number of seconds.

Ideally, that would be user-specified factor, but I don't think that's possible
to specify in SQL; the constant has to be built into the C function. At
telsasoft, our data mostly has 15minute granularity (900sec), so we'd maybe
make a "date_trunc" function in the user schema which calls the
pg_catalog.date_trunc with support function timestamp_support_10pow3;

There could be a "base" support function that accepts a multiplier argument,
and then any user-provided C extension would be a one-liner specifing an
arbitrary value;

3) Maybe there are better functions than date_trunc() to address;

4) Leave it as a patch in the archives for people to borrow from;

Justin

Attachment	Content-Type	Size
v1-0001-Planner-support-functions-for-GROUP-BY-f.patch	text/x-diff	11.0 KB

In response to

planner support functions: handle GROUP BY estimates ? at 2019-11-19 19:34:21 from Justin Pryzby

Responses

Re: planner support functions: handle GROUP BY estimates ? at 2019-12-26 21:32:50 from Justin Pryzby

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2019-12-23 00:24:09	Drongo vs. 9.4 initdb TAP test
Previous Message	tsunakawa.takay@fujitsu.com	2019-12-22 23:38:43	RE: Implementing Incremental View Maintenance