patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider

From: Anton Voloshin <a(dot)voloshin(at)postgrespro(dot)ru>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: r(dot)zharkov(at)postgrespro(dot)ru
Subject: patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider
Date: 2022-10-21 17:23:33
Message-ID: 52104a17-7a23-c315-1a97-06c691af748c@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, hackers.

In current master, as well as in REL_15_STABLE, installcheck in
contrib/citext fails in most locales, if we use ICU as a locale provider:

$ rm -fr data; initdb --locale-provider icu --icu-locale en-US -D data
&& pg_ctl -D data -l logfile start && make -C contrib/citext
installcheck; pg_ctl -D data stop; cat contrib/citext/regression.diffs
...
test citext ... ok 457 ms
test citext_utf8 ... FAILED 21 ms
...
diff -u
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out
---
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out
2022-07-14 17:45:31.747259743 +0300
+++
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out
2022-10-21 19:43:21.146044062 +0300
@@ -54,7 +54,7 @@
SELECT 'i'::citext = 'İ'::citext AS t;
t
---
- t
+ f
(1 row)

The reason is that in ICU lowercasing Unicode symbol "İ" (U+0130
"LATIN CAPITAL LETTER I WITH DOT ABOVE") can give two valid results:
- "i", i.e. "U+0069 LATIN SMALL LETTER I" in "tr" and "az" locales.
- "i̇", i.e. "U+0069 LATIN SMALL LETTER I" followed by "U+0307 COMBINING
DOT ABOVE" in all other locales I've tried (including "en-US", "de",
"ru", etc).
And the way this test is currently written only accepts plain latin "i",
which might be true in glibc, but is not so in ICU. Verified on ICU
70.1, but I've seen this on few other ICU versions as well, so I think
this is probably an ICU's feature, not a bug(?).

Since we probably want installcheck in contrib/citext to pass on
databases with various locales, including reasonable ICU-based ones,
I suggest to fix this test by accepting either of outputs as valid.

I can see two ways of doing that:
1. change SQL in the test to use "IN" instead of "=";
2. add an alternative output.

I think in this case "IN" is better, because that allows a single
comment to address both possible outputs and to avoid unnecessary
duplication.

I've attached a patch authored mostly by my colleague, Roman Zharkov, as
one possible fix.

Only versions 15+ are affected.

Any comments?

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

Attachment Content-Type Size
0001-Fix-citext_utf8-test-s-Turkish-I-with-ICU-collation-.patch text/x-patch 2.6 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-10-21 19:17:47 Re: refactor ownercheck and aclcheck functions
Previous Message David Kimura 2022-10-21 17:11:38 Multiple grouping set specs referencing duplicate alias