Character expansion with ICU collations

From: "Finnerty, Jim" <jfinnert(at)amazon(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Character expansion with ICU collations
Date: 2021-06-09 15:31:33
Message-ID: 5FF0F55B-1593-4FBC-A81A-7F3F6D1E4388@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Is there a way to get ‘character expansions’ with the ICU collations that are available in PostgreSQL?

Using this example on a database with UTF-8 encoding:

CREATE COLLATION CI_AS (provider = icu, locale=’utf8(at)colStrength=secondary’, deterministic = false);

CREATE TABLE MyTable3
(
ID INT IDENTITY(1, 1),
Comments VARCHAR(100)
)

INSERT INTO MyTable3 (Comments) VALUES ('strasse')
INSERT INTO MyTable3 (Comments) VALUES ('straße')

SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'strasse'
SELECT * FROM MyTable3 WHERE Comments COLLATE CI_AS = 'straße'

We would like to control whether each SELECT statement finds both records (because the sort key of ‘ß’ equals the sort key of ‘ss’), or whether each SELECT statement finds just one record. ICU supports character expansions and other tailorings that support advanced features like changing the collation order for specific characters, and while CREATE COLLATION doesn’t expose tailoring directives that do either character expansion or specific character reorderings (other than @colReorder to reorder entire categories of characters such as Greek vs Roman) , it seems to be the expectation that many <language> <country> pairs such as en_US should already cause ‘ß’ to match ‘ss’, not just to have them sort close together (which they do).

If PostgreSQL supports character expansion with ICU collations, can someone provide an example where 'strasse' = 'straße'?

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2021-06-09 15:42:34 Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Previous Message Alvaro Herrera 2021-06-09 15:29:17 Re: Decoding speculative insert with toast leaks memory