Re: Character expansion with ICU collations

From: "Finnerty, Jim" <jfinnert(at)amazon(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Character expansion with ICU collations
Date: 2021-06-21 13:23:38
Message-ID: 10F78B0E-3C4B-4BF8-9EF0-BEE684F4C8CC@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have a proposal for how to support tailoring rules in ICU collations: The ucol_openRules() function is an alternative to the ucol_open() function that PostgreSQL calls today, but it takes the collation strength as one if its parameters so the locale string would need to be parsed before creating the collator. After the collator is created using either ucol_openRules or ucol_open, the ucol_setAttribute() function may be used to set individual attributes from keyword=value pairs in the locale string as it does now, except that the strength probably can't be changed after opening the collator with ucol_openRules. So the logic in pg_locale.c would need to be reorganized a little bit, but that sounds straightforward.

One simple solution would be to have the tailoring rules be specified as a new keyword=value pair, such as colTailoringRules=<rulestring>. Since the <rulestring> may contain single quote characters or PostgreSQL escape characters, any single quote characters or escapes would need to be escaped using PostgreSQL escape rules. If colTailoringRules is present, colStrength would also be known prior to opening the collator, or would default to tertiary, and we would keep a local flag indicating that we should not process the colStrength keyword again, if specified.

Representing the TailoringRules as just another keyword=value in the locale string means that we don't need any change to the catalog to store it. It's just part of the locale specification. I think we wouldn't even need to bump the catversion.

Are there any tailoring rules, such as expansions and contractions, that we should disallow? I realize that we don't handle nondeterministic collations in LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a database with a UTF-8 encoding and arbitrary tailoring rules that include expansions and contractions, is it still guaranteed that expr must sort BETWEEN 'a' AND ('a' || E'/uFFFF') ?

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Filip Gospodinov 2021-06-21 13:47:38 Fix pkg-config file for static linking
Previous Message Simon Riggs 2021-06-21 13:08:12 Doc chapter for Hash Indexes