Re: Super PathKeys (Allowing sort order through precision loss functions)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Super PathKeys (Allowing sort order through precision loss functions)
Date: 2018-10-31 01:23:16
Message-ID: 6f2aae89-224c-129f-3899-9f353fb655ac@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 10/30/2018 11:41 PM, David Rowley wrote:
> On 31 October 2018 at 08:52, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
>>> I've started working on something I've ended up calling "Super
>>> PathKeys". The idea here is to increase the likelihood of a Path with
>>> PathKeys being used for a purpose that requires a less strict sort
>>> order due to ordering being required from the return value of some
>>> precision loss function.
>>
>> I'm a little confused by the terminology here, or why this has anything
>> at all to do with a new sort of pathkey. It seems like the idea you're
>> driving at is to be able to mark a function as being order-preserving,
>> in the sense that if one input is known sorted then the output will
>> also be sorted (by the same or a related opclass). You probably need
>> some additional constraints, like any other inputs being constants,
>> before that really works. But given that, I don't see why you need a
>> new kind of pathkey: you just have a new way to conclude that some
>> path is sorted by the pathkey you already want.
>
> Thanks for chipping in on this.
>
> The additional pathkeys are not required to make it work, they're just
> required to make it work efficiently. The fact that we could to the
> trouble of making pathkeys canonical so we can perform pointer
> comparison rather than using equals() says to me that I'd better not
> do anything to slow this down too much. Doing it without the superkey
> idea seems to require quite a bit of analysis during
> pathkeys_contained_in() as we'd need to check for superkeys then,
> instead of when we're building the pathkey in the first place. As
> for the code that I did add to pathkeys_contained_in(), I doubt it's
> measurably slower for the normal case.
>
> The other fields being Const part I did miss. That will also be a
> requirement. I just failed to consider that date_trunc() could be used
> with a variable 1st arg.
>
>> Maybe if I read the patch it'd be clearer why you want to describe it
>> that way, but I'm too lazy to do that right now. One thing I would
>> say though is that a pg_proc column identifying the interesting input
>> parameter is insufficient; you'd need to *explicitly* say which opclass(es)
>> the sorting behavior guarantee applies for.
>>

The other thing likely affecting this is locale / collation. Probably
not for date_trunc, but certainly for things like substr()/trim(),
mentioned by Simon upthread.

In some languages the rules are pretty complex, and there's no chance
it'll survive arbitrary substr() applied to the string. For example, in
Czech we mostly sort character-by-character, but "ch" is an exception
sorted in between "h" and "i".

So essentially "hhhh <= hchh <= hihh". Obviously, substr($1,0,3) cuts
the "ch" in half, changing the ordering:

create table x (y text collate "cs_CZ");
insert into x values ('hhhh'), ('hchh'), ('hihh');

test=# select y from x order by 1;
y
------
hhhh
hchh
hihh
(3 rows)

test=# select substr(y,0,3) from x order by 1;
substr
--------
hc
hh
hi
(3 rows)

I'm preeeeeeetty sure other languages have even funnier rules ...

>>> -- Test a more complex case where the superkey can be matched to
>>> multiple pathkeys
>>> explain (costs off) select date_trunc('year', ts), date_trunc('month',
>>> ts), a, count(*) from tstbl group by 1,2,3 order by 1,2,3;
>>> QUERY PLAN
>>> -----------------------------------------------------------------------------
>>> GroupAggregate
>>> Group Key: date_trunc('year'::text, ts), date_trunc('month'::text, ts), a
>>> -> Index Only Scan using tstbl_ts_a_idx on tstbl
>>> (3 rows)
>>
>> [ squint... ] Does that really work? If so, how? It would need a whole
>> lot more knowledge about the behavior of date_trunc than I think could
>> possibly be reasonable to encode in a general mechanism.
>
> I'm not entirely certain we can always consume multiple matching keys
> or if it has to be one for one. However, I get an empty diff from each
> of the following two query pairs.
>
> select date_trunc('month'::text,date),date_Trunc('year', date) from dt
> order by dt;
> select date_trunc('month'::text,date),date_Trunc('year', date) from dt
> order by 1,2;
>
> and
>
> select date_trunc('year'::text,date),date_Trunc('month', date) from dt
> order by dt;
> select date_trunc('year'::text,date),date_Trunc('month', date) from dt
> order by 1,2;
>
> with setup:
>
> create table dt (date date);
> insert into dt select d from generate_series('2018-01-01',
> '2018-12-31', '1 day'::interval) d;
>

I'm mildly suspicious of this data set, because it's essentially
perfectly sorted/deterministic, and the Sort node won't have to do
anything. So perhaps it just works by chance.

Consider this instead:

create table dt (ts timestamp, x text);

insert into dt select * from (select d, (case when random() < 0.5 then
'month' else 'hour' end) from generate_series('2018-01-01'::timestamp,
'2018-12-31'::timestamp, '1 hour'::interval) d) as foo order by random();

test=# explain select date_trunc(x, ts) from dt order by 1;
QUERY PLAN
-------------------------------------------------------------
Sort (cost=729.18..751.02 rows=8737 width=8)
Sort Key: (date_trunc(x, ts))
-> Seq Scan on dt (cost=0.00..157.21 rows=8737 width=8)
(3 rows)

test=# select date_trunc(x, ts) from dt order by 1;

date_trunc
---------------------
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
... seems ok ...

test=# explain select date_trunc(x, ts) from dt order by ts;
QUERY PLAN
--------------------------------------------------------------
Sort (cost=729.18..751.02 rows=8737 width=16)
Sort Key: ts
-> Seq Scan on dt (cost=0.00..157.21 rows=8737 width=16)
(3 rows)

test=# select date_trunc(x, ts) from dt order by ts;

date_trunc
---------------------
2018-01-01 00:00:00
2018-01-01 01:00:00
2018-01-01 02:00:00
2018-01-01 00:00:00
2018-01-01 04:00:00
2018-01-01 05:00:00
2018-01-01 06:00:00
2018-01-01 07:00:00
2018-01-01 08:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 13:00:00
2018-01-01 14:00:00
2018-01-01 00:00:00
2018-01-01 16:00:00
2018-01-01 17:00:00
2018-01-01 18:00:00
2018-01-01 19:00:00
2018-01-01 20:00:00
2018-01-01 21:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-02 00:00:00
2018-01-01 00:00:00
2018-01-02 02:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-02 05:00:00
2018-01-02 06:00:00
2018-01-02 07:00:00
2018-01-02 08:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-01 00:00:00
2018-01-02 13:00:00
... kaboooooom! ...

I suspect this runs into the date_trunc() precision parameter being
variable, which is something Tom mentioned as a potential issue.

The already-mentioned substr/trim are another example of this issue. Not
only must the parameters be constant (or generally using the same value
for all rows), but it may work for some values and fail for others.

For example substr($1,0,$2) might work, but substr($1,10,$2) likely not.
Similarly for trim().

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-10-31 01:50:19 Re: Should pg 11 use a lot more memory building an spgist index?
Previous Message Masahiko Sawada 2018-10-31 00:23:18 Re: [HACKERS] Block level parallel vacuum