Re: Use correct collation in pg_trgm

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Use correct collation in pg_trgm
Date: 2026-01-22 08:51:59
Message-ID: CALdSSPi1o=mYL-wWKfb-AZC+ozUKd-pRkHYqzi_yK9+vtT+V0Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 21 Jan 2026 at 20:36, David Geier <geidav(dot)pg(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
> CREATE EXSTENSION pg_trgm;
> CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
>
> postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> CREATE TABLE test(col TEXT COLLATE "turkish");
> INSERT INTO test VALUES ('ISTANBUL');
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 0.5
>
> If the database is initialized via initdb --locale="tr_TR.utf8", the
> output changes:
>
> postgres=# SELECT show_trgm('ISTANBUL');
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> and
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 1
>
> tr_TR.utf8 converts capital I to ı which is a multibyte character, while
> my default collation converts I to i.
>
> The attached patch attempts to fix that. I grepped for all occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
>
> The corresponding regression tests pass.
>
> [1]
> https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
>
> --
> David Geier

Hi!
LGTM

--
Best regards,
Kirill Reshke

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-01-22 09:01:38 Re: tablecmds: reject CLUSTER ON for partitioned tables earlier
Previous Message Ilyasov Ian 2026-01-22 08:37:56 RE: ReadRecentBuffer() doesn't scale well