Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, adam(dot)warland(at)infor(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
Date: 2025-12-02 17:29:06
Message-ID: 6387cb3e-aec8-41a0-acef-bacdbfb435db@iki.fi
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 02/12/2025 18:36, Heikki Linnakangas wrote:
> On 02/12/2025 18:24, Laurenz Albe wrote:
>> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
>>> PostgreSQL version: 18.1
>>>
>>> When using a nondeterministic ICU collation, the replace() function
>>> fails to
>>> replace a substring when that substring appears at the end of the input
>>> string.
>>>
>>> Occurrences of the same substring earlier in the string are replaced
>>> normally.
>>>
>>> Specific collation used:
>>> create collation test_nondeterministic (
>>>      provider = icu,
>>>      locale = 'und-u-ks-level2',
>>>      deterministic = false
>>> )
>>>
>>> -- Replace final character under nondeterministic collation
>>> SELECT replace(
>>>      'testx' COLLATE "test_nondeterministic",
>>>      'x'     COLLATE "test_nondeterministic",
>>>      'y') AS res1;
>>
>> I can reproduce the problem, and the attached patch fixes it for me.
>
> +1, looks good to me. Let's also add a regression test for this.

I added a simple test for this, and I think this is still not quite
right. I added the following to collate.icu.utf test:

CREATE TABLE test4nfd (a int, b text);
INSERT INTO test4nfd VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4,
'côté');
UPDATE test4nfd SET b = normalize(b, nfd);
-- This shows why replace should be greedy. Otherwise, in the NFD
-- case, the match would stop before the decomposed accents, which
-- would leave the accents in the results.
SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4;
a | b | replace
---+------+---------
1 | cote | mate
2 | côte | mate
3 | coté | maté
4 | côté | maté
(4 rows)

SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4nfd;
a | b | replace
---+------+---------
1 | cote | mate
2 | côte | mate
3 | coté | maté
4 | côté | maté
(4 rows)

+-- Test for match at the end of the string. (We had a bug on that
+-- once)
+SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd;
+ a | b | replace
+---+------+---------
+ 1 | cote | coma
+ 2 | côte | coma
+ 3 | coté | coma
+ 4 | côté | coma
+(4 rows)
+

In the added test query, the accents on the 'o' are stripped, which
doesn't look correct.

- Heikki

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Laurenz Albe 2025-12-02 17:45:47 Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
Previous Message Tom Lane 2025-12-02 17:25:52 Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation