Quick Links

Re: Reduce build times of pg_trgm GIN indexes

From:	David Geier <geidav(dot)pg(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Reduce build times of pg_trgm GIN indexes
Date:	2026-01-21 15:45:06
Message-ID:	66620ec7-0f81-4813-9cf1-b901a56efcc3@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>> Oh, that's evil. I had tested that specifically. But it only worked
>> because the code in master uses str_tolower() with
>> DEFAULT_COLLATION_OID. So using a different locale like in the following
>> example does something different than when creating a database with the
>> same locale.
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>> lower
>> -------
>> ııı
>>
>> postgres=# select show_trgm('III' COLLATE "tr_TR");
>> show_trgm
>> -------------------------
>> {" i"," ii","ii ",iii}
>> (1 row)
>>
>> But when using tr_TR as default locale of the database the following
>> happens:
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>> lower
>> -------
>> ııı
>>
>> postgres=# select show_trgm('III');sü
>> show_trgm
>> ---------------------------------------
>> {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>>
>> I'm wondering if that's intentional to begin with. Shouldn't the code
>> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
>> research to see how other index types handle locales.
>>
>> Coming back to the original problem: the lengthy comment at the top of
>> pg_locale_libc.c, suggests that in some cases ASCII characters are
>> handled the pg_ascii_tolower() way for the default locale. See for
>> example tolower_libc_mb(). So a character by character conversion using
>> that function will yield a different result than strlower_libc_mb(). I'm
>> wondering why that is.
>
> Hmm, yeah, that feels funny. The trigram code predates per-column
> collation support, so I guess we never really thought through how it
> should interact with COLLATE clauses.

I've written a patch to fix that. See [1].

>> Anyways, we could limit the optimization to only kick in when the used
>> locale follows the same rules as pg_ascii_tolower(). We could test that
>> when creating the locale and store that info in pg_locale_struct.
>
> I think that's only possible for libc locales, which operate one
> character at a time. In ICU locales, lower-casing a character can depend
> on the surrounding characters, so you cannot just test the conversion of
> every ascii character individually. It would make sense for libc locales
> though, and I hope the ICU functions are a little faster anyway.
>
> Although, we probably should be using case-folding rather than lower-
> casing with ICU locales anyway. Case-folding is designed for string
> matching. It'd be a backwards-compatibility breaking change, though.

Oh, I wasn't ware of that. Doing it only for libc locales seems still
useful.

Good point with the casefolding. I'll look into that.

How do we usually go about such backwards-compatibility breaking
changes? Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?

[1]
https://www.postgresql.org/message-id/flat/db087c3e-230e-4119-8a03-8b5d74956bc2%40gmail.com

--
David Geier

In response to

Re: Reduce build times of pg_trgm GIN indexes at 2026-01-12 22:10:03 from Heikki Linnakangas

Responses

Re: Reduce build times of pg_trgm GIN indexes at 2026-01-21 20:50:54 from Matthias van de Meent

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2026-01-21 16:59:29	Re: [PATCH] Provide support for trailing commas
Previous Message	David Geier	2026-01-21 15:36:18	Use correct collation in pg_trgm