Re: Super PathKeys (Allowing sort order through precision loss functions)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Super PathKeys (Allowing sort order through precision loss functions)
Date: 2018-10-31 13:18:35
Message-ID: 71c7bbe1-49d2-36c9-a93c-a3a39a977068@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/31/2018 04:32 AM, David Rowley wrote:
> On 31 October 2018 at 14:23, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> The other thing likely affecting this is locale / collation. Probably
>> not for date_trunc, but certainly for things like substr()/trim(),
>> mentioned by Simon upthread.
>>
>> In some languages the rules are pretty complex, and there's no chance
>> it'll survive arbitrary substr() applied to the string. For example, in
>> Czech we mostly sort character-by-character, but "ch" is an exception
>> sorted in between "h" and "i".
>>
>> So essentially "hhhh <= hchh <= hihh". Obviously, substr($1,0,3) cuts
>> the "ch" in half, changing the ordering:
>>
>> create table x (y text collate "cs_CZ");
>> insert into x values ('hhhh'), ('hchh'), ('hihh');
>>
>> test=# select y from x order by 1;
>> y
>> ------
>> hhhh
>> hchh
>> hihh
>> (3 rows)
>>
>> test=# select substr(y,0,3) from x order by 1;
>> substr
>> --------
>> hc
>> hh
>> hi
>> (3 rows)
>>
>> I'm preeeeeeetty sure other languages have even funnier rules ...
>
> That's pretty interesting, but as mentioned in my initial email...
> More careful thought would be needed for other numerical types and
> text types, I imagine, though I've not thought much about that.
>
> I don't really think trim() or substr() would ever work for the reason
> that they don't always operate on a prefix of the string. What you've
> mentioned seems to rule out LEFT().
>

Sure, but it wasn't very clear to me what the "more careful thought"
would mean.

I think a direct consequence of Tom's point about opclasses is that
storing this information in pg_proc is not going to fly - you would need
a separate catalog for that, to allow mapping the function to multiple
opclasses (possibly with different features?). But OK, that's doable.

But what if you also need to do that for collations? Firstly, it would
it add another degree of freedom, essentially making it (proc, opclass,
collation, ... metadata ...) so there would be many such combinations. I
can't really imagine defining that manually, but maybe it can be
simplified. But more importantly - the list of collations is kinda
dynamic, and AFAIK the collation rules may change depending on the
glibc/icu versions. So, that seems pretty tricky. It's certainly not a
just a SMOP.

>> I'm mildly suspicious of this data set, because it's essentially
>> perfectly sorted/deterministic, and the Sort node won't have to do
>> anything. So perhaps it just works by chance.
>>
>> Consider this instead:
>>
>> create table dt (ts timestamp, x text);
>>
>> insert into dt select * from (select d, (case when random() < 0.5 then
>> 'month' else 'hour' end) from generate_series('2018-01-01'::timestamp,
>> '2018-12-31'::timestamp, '1 hour'::interval) d) as foo order by random();
>>
>
> [...]
>
>> 2018-01-01 00:00:00
>> 2018-01-01 00:00:00
>> 2018-01-01 00:00:00
>> 2018-01-02 13:00:00
>> ... kaboooooom! ...
>
> Yeah. This is an issue. Thanks for the test case. However, I
> acknowledged that in my reply to Tom. I did overlook it, which was
> completely down to lack of imagination on my part. I'd never
> considered using date_trunc() without a const 1st argument before. It
> seems simple enough to disable the optimisation in that case. I've
> attached a patch which does that. Maybe that'll help us look beyond
> this and focus on any other reasons why this is not possible.
>

Ah, sorry - I've missed this bit in your response. In my defense, it's
been quite late over here, and my caffeine level got a tad too low.

Anyway, the question is how strong would the requirement need to be, and
how would that affect applicability of this optimization in other cases.
I agree it's probably not an issue for date_trunc() - I don't recall
ever using it with non-constant first parameter. I wonder if there are
other functions where that's not the case - although, that's not really
an argument against this optimization.

> It's also true that this diminishes the usefulness of the idea, but
> part of the reason I've posting the idea so early after having thought
> about it is precisely to see if this is going to float or sink.

Sure. And pgsql-hackers are very good in sinking ideas ;-)

> Maybe we'll decide the scope of usefulness is so small that it's not
> worth it, or that each function has such different requirements that
> we can't reasonably make it work by adding a few columns to pg_proc.
> I'm personally more interested in the cases that can work. I
> understand there is no shortage of cases where it can't.
>

I think it can't be made just by adding a couple of columns to pg_proc,
as explained above. A separate catalog mapping procs to opclasses (and
maybe collations) may be needed. For cases where it depends on the
actual parameter values (like substr/trim/ltrim) an extra function might
be needed (something like the selectivity functions for data types).

> Giving that we require const arguments away from the orderkey, perhaps
> it could be made to work for simple arithmetic OpExprs. I'm not sure
> if the use cases are actually there for that sort of thing and I've
> seen WHERE indexcol+0 = <n> used many times to disable the use of
> indexes, so making pathkeys see through those might be more annoying
> than useful... But it's a thought...
>

Well, that really depends on the definition of the "+" operator, which
is pretty much just a function. So I don't see why would that simplify
the situation?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-10-31 13:30:56 Re: WIP Patch: Add a function that returns binary JSONB as a bytea
Previous Message Amit Kapila 2018-10-31 13:09:07 Re: WIP: Avoid creation of the free space map for small tables