Super PathKeys (Allowing sort order through precision loss functions)

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Super PathKeys (Allowing sort order through precision loss functions)
Date: 2018-10-30 07:58:15
Message-ID: CAKJS1f_pGLiGruPxU44dUMKiYouHjrODyTy7SxFJ5hVtoUYf8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Hackers,

I've started working on something I've ended up calling "Super
PathKeys". The idea here is to increase the likelihood of a Path with
PathKeys being used for a purpose that requires a less strict sort
order due to ordering being required from the return value of some
precision loss function.

This is best explained by example, so here's the actual and expected
output of some tests that the patch adds:

create table tstbl (ts timestamp, a int);
create index on tstbl (ts, a);
set enable_sort = 0;
explain (costs off) select date_trunc('year', ts), a, count(*) from
tstbl group by 1,2 order by 1,2;
QUERY PLAN
-----------------------------------------------------
GroupAggregate
Group Key: date_trunc('year'::text, ts), a
-> Index Only Scan using tstbl_ts_a_idx on tstbl
(3 rows)

-- Test a more complex case where the superkey can be matched to
multiple pathkeys
explain (costs off) select date_trunc('year', ts), date_trunc('month',
ts), a, count(*) from tstbl group by 1,2,3 order by 1,2,3;
QUERY PLAN
-----------------------------------------------------------------------------
GroupAggregate
Group Key: date_trunc('year'::text, ts), date_trunc('month'::text, ts), a
-> Index Only Scan using tstbl_ts_a_idx on tstbl
(3 rows)

You can see here that the planner managed to make use of the index on
(ts, a) to provide sorted input to the GROUP BY date_trunc('year',
ts). This saves a possible slow HashAggregate plan, which would be
especially bad if there are a large number of groups. It'll also be
very useful if only a subset of all groups are required and a LIMIT
clause is specified. It should also be useful for reports run on large
timeseries data tables, especially partitioned ones when teamed up
with the "Ordered Partitioned Table Scans" in [1].

Benchmarks don't seem so relevant here as it would be simple enough to
craft them to show anything between 0% improvement up to some number
that's difficult to pronounce. I think it's better to discuss how I
think this is possible.

Implementation:

I've modified pg_proc to add a new column which optionally can be set
to the argument number of the input parameter that the return value
will be calculated from. In the patch, I've only set this so far for
the date_trunc() function and a handful of casting functions. When
building a PathKey the expression used in the key is passed into a
function which tries to extract the superkey expr. If a non-NULL value
is returned then an additional PathKey is generated with that expr and
that's set in a new field named pk_superkey in the original PathKey.
If the superkey expr itself contains another superkey expr then the
superkey PathKey just itself has a link to another superkey, and so
on.

pathkeys_contained_in() has been modified so that when a non-matching
PathKey is found it tries again by looking at the pk_superkey (if set)
and keeps walking up the chain of PathKeys until it reaches a PathKey
without a pk_superkey, or it finds a match. The function also must
check if the next key1 matches the same key2. This allows the 2nd
example above to work since both date_trunc() calls match the "ts" key
in the index scan. If the 2nd attempt does not match then we just
backup one and move to the next key2 and repeat.

I think this functionality could also be used to allow superkeys to be
derived through casts. We'd need to be careful to only tag casting
functions where we're certain the sort order of the input and output
types match. In the patch, I've just tagged a couple of casting
functions between timestamp and date, but I imagine it should also be
possible for casts between all the INT types, probably in either
direction. More careful thought would be needed for other numerical
types and text types, I imagine, though I've not thought much about
that.

Syntax:

I've not really gone very far to think about exposing this to userland
yet. I've only thought as far as

CREATE FUNCTION name (ORDER KEY <p1name> <type>) RETURNS ...

with some checks done in CreateFunction() to ensure there's not more
than 1 ORDER KEY, and one is only specified for non-void functions.
"KEY" is unreserved but so is "BY", so I don't forsee any grammar
issues with the above, though I'm no expert.

Status of the patch:

The patch is very much still a proof of concept. I'm abusing some
parser level functions to find the correct sortop for the super
PathKey. There also might be some giant holes in the entire concept as
I only dreamed the entire thing during a weekend trip while about 1.5
mountains out of range of the internet. I only started experimenting
with the idea yesterday.

The reason I'm posting now is:

1. Transparency. I'm working on this.
2. Someone might point out a good reason why this can't be done.
3. Some testing or fuzz testing might reveal some bugs in the patch.

Please find attached the proof of concept patch.

Comments (good or bad) welcome. Reviews and testing is also welcome.

I'm considering adding it to the November 'fest. Likely I'll decided
that based on any feedback received before then.

The "superkey" name is borrowed from OOP. Think "superclass". If it
turns out not to be a dud, we can call it something else.

[1] https://commitfest.postgresql.org/20/1850/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
v1-0001-Allow-Pathkeys-to-derive-their-order-from-a-paren.patch application/octet-stream 21.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2018-10-30 08:04:56 Re: Installation instructions update (pg_ctl)
Previous Message Laurenz Albe 2018-10-30 07:42:41 Re: Getting fancy errors when accessing information_schema on 10.5