| From: | Kirill Reshke <reshkekirill(at)gmail(dot)com> |
|---|---|
| To: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: Use correct collation in pg_trgm |
| Date: | 2026-01-22 08:51:59 |
| Message-ID: | CALdSSPi1o=mYL-wWKfb-AZC+ozUKd-pRkHYqzi_yK9+vtT+V0Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, 21 Jan 2026 at 20:36, David Geier <geidav(dot)pg(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
> CREATE EXSTENSION pg_trgm;
> CREATE COLLATION turkish (provider = libc, locale = 'tr_TR.utf8');
>
> postgres=# SELECT show_trgm('ISTANBUL' COLLATE "turkish");
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> CREATE TABLE test(col TEXT COLLATE "turkish");
> INSERT INTO test VALUES ('ISTANBUL');
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> ---------------------------------------------
> {" i"," is",anb,bul,ist,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 0.5
>
> If the database is initialized via initdb --locale="tr_TR.utf8", the
> output changes:
>
> postgres=# SELECT show_trgm('ISTANBUL');
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> and
>
> postgres=# select show_trgm(col) FROM test;
> show_trgm
> --------------------------------------------------------
> {0xf31e1a,0xfe581d,0x3efd30,anb,bul,nbu,sta,tan,"ul "}
>
> postgres=# SELECT similarity('ıstanbul' COLLATE "turkish", 'ISTANBUL'
> COLLATE "turkish");
> similarity
> ------------
> 1
>
> tr_TR.utf8 converts capital I to ı which is a multibyte character, while
> my default collation converts I to i.
>
> The attached patch attempts to fix that. I grepped for all occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
>
> The corresponding regression tests pass.
>
> [1]
> https://www.postgresql.org/message-id/e5dd01c6-c469-405d-aea2-feca0b2dc34d%40gmail.com
>
> --
> David Geier
Hi!
LGTM
--
Best regards,
Kirill Reshke
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-01-22 09:01:38 | Re: tablecmds: reject CLUSTER ON for partitioned tables earlier |
| Previous Message | Ilyasov Ian | 2026-01-22 08:37:56 | RE: ReadRecentBuffer() doesn't scale well |