Re: strcmp() tie-breaker for identical ICU-collated strings

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strcmp() tie-breaker for identical ICU-collated strings
Date: 2017-06-09 16:17:03
Message-ID: CA+TgmoaRTar_j6SjP6c-ZbMdL6X0U52yJgn-=yEyW1qc17BAkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 9, 2017 at 11:46 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> writes:
>> On 6/9/17 11:12, Tom Lane wrote:
>>> https://www.postgresql.org/message-id/27064.1134753128@sss.pgh.pa.us
>
>> Good to know. That just says that if we were to go with the strcoll()
>> result only, things would work correctly.
>
> There's still the hashing problem.

Tom, that mailing list discussions is very illuminating. Thanks for
digging it up.

Regarding the question of hashing, one way to support that would be if
we had some sort of canonicalization function. IOW, suppose there
were a collation API call distill() which had the property that
strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal
under that collation. Then, you could define your hash function as
hash_any(distill(X)). Alternatively, if the collation library
provided its own hashing function, that would be fine too, and
probably faster.

On the other hand, is there any rule that says we have to support
hashing? Certainly, if we defined a new datatype collated_text, it
could have a btree opfamily and no hash opfamily. It's trickier with
only one datatype, but possibly we could come up with a way for an
opfamily to be consulted about whether it is available for a given
choice of collation. I'm not exactly sure what is possible or
desirable, but I would not be too surprised to hear complaints about
the observed behavior different from the "pure" ICU behavior because
of the tiebreak, and at least some users might even find it worth
giving up hashing in order to get the exact sort order they need.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-06-09 16:18:53 Re: strcmp() tie-breaker for identical ICU-collated strings
Previous Message Tom Lane 2017-06-09 16:06:36 Re: partial aggregation with internal state type