Quick Links

Re: ICU integration

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: ICU integration
Date:	2016-09-25 05:16:39
Message-ID:	CAEepm=30SQpEUjau=dScuNeVZaK2kJ6QQDCHF75u5W=Cz=3Scw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Sep 24, 2016 at 10:13 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> It looks like varstr_abbrev_convert calls strxfrm unconditionally
>> (assuming TRUST_STRXFRM is defined). <captain-obvious>This needs to
>> use ucol_getSortKey instead when appropriate.</> It looks like it's a
>> bit more helpful than strxfrm about telling you the output buffer size
>> it wants, and it doesn't need nul termination, which is nice.
>> Unfortunately it is like strxfrm in that the output buffer's contents
>> is unspecified if it ran out of space.
>
> One can use the ucol_nextSortKeyPart() interface to just get the first
> 4/8 bytes of an abbreviated key, reducing the overhead somewhat, so
> the output buffer size limitation is probably irrelevant. The ICU
> documentation says something about this being useful for Radix sort,
> but I suspect it's more often used to generate abbreviated keys.
> Abbreviated keys were not my original idea. They're really just a
> standard technique.

Nice! The other advantage of ucol_nextSortKeyPart is that you don't have
to convert the whole string to UChar (UTF16) first, as I think you would
need to with ucol_getSortKey, because the UCharIterator mechanism can read
directly from a UTF8 string. I see in the documentation that
ucol_nextSortKeyPart and ucol_getSortKey don't have compatible output, and
this caveat may be related to whether sort key compression is used. I
don't understand what sort of compression is involved but out of curiosity
I asked ICU to spit out some sort keys from ucol_nextSortKeyPart so I could
see their size. As you say, we can ask it to stop at 4 or 8 bytes which is
very convenient for our purposes, but here I asked for more to get the full
output so I could see where the primary weight part ends. The primary
weight took one byte for the Latin letters I tried and two for the Japanese
characters I tried (except 一 which was just 0xaa).

ucol_nextSortKeyPart(en_US, "a", ...) -> 29 01 05 01 05
ucol_nextSortKeyPart(en_US, "ab", ...) -> 29 2b 01 06 01 06
ucol_nextSortKeyPart(en_US, "abc", ...) -> 29 2b 2d 01 07 01 07
ucol_nextSortKeyPart(en_US, "abcd", ...) -> 29 2b 2d 2f 01 08 01 08
ucol_nextSortKeyPart(en_US, "A", ...) -> 29 01 05 01 dc
ucol_nextSortKeyPart(en_US, "AB", ...) -> 29 2b 01 06 01 dc dc
ucol_nextSortKeyPart(en_US, "ABC", ...) -> 29 2b 2d 01 07 01 dc dc dc
ucol_nextSortKeyPart(en_US, "ABCD", ...) -> 29 2b 2d 2f 01 08 01 dc dc dc dc
ucol_nextSortKeyPart(ja_JP, "一", ...) -> aa 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "一二", ...) -> aa d0 0f 01 06 01 06
ucol_nextSortKeyPart(ja_JP, "一二三", ...) -> aa d0 0f cb b8 01 07 01 07
ucol_nextSortKeyPart(ja_JP, "一二三四", ...) -> aa d0 0f cb b8 cb d5 01 08 01 08
ucol_nextSortKeyPart(ja_JP, "日", ...) -> d0 18 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "日本", ...) -> d0 18 d1 d0 01 06 01 06
ucol_nextSortKeyPart(fr_FR, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_FR, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_FR, "coté", ...) -> 2d 45 4f 31 01 42 88 01 09
ucol_nextSortKeyPart(fr_FR, "côté", ...) -> 2d 45 4f 31 01 44 8e 44 88 01 0a
ucol_nextSortKeyPart(fr_CA, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_CA, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_CA, "coté", ...) -> 2d 45 4f 31 01 88 08 01 09
ucol_nextSortKeyPart(fr_CA, "côté", ...) -> 2d 45 4f 31 01 88 44 8e 06 01 0a

I wonder how it manages to deal with fr_CA's reversed secondary weighting
rule which requires you to consider diacritics in reverse order --
apparently abandoned in France but still used in Canada -- using a fixed
size space for state between calls.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Re: ICU integration at 2016-09-24 10:13:31 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2016-09-25 05:18:26	Re: Hash Indexes
Previous Message	Amit Kapila	2016-09-25 05:00:33	Re: Write Ahead Logging for Hash Indexes