Re: Reduce build times of pg_trgm GIN indexes

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce build times of pg_trgm GIN indexes
Date: 2026-01-21 15:45:06
Message-ID: 66620ec7-0f81-4813-9cf1-b901a56efcc3@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>> Oh, that's evil. I had tested that specifically. But it only worked
>> because the code in master uses str_tolower() with
>> DEFAULT_COLLATION_OID. So using a different locale like in the following
>> example does something different than when creating a database with the
>> same locale.
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>>   lower
>> -------
>>   ııı
>>
>> postgres=# select show_trgm('III' COLLATE "tr_TR");
>>          show_trgm
>> -------------------------
>>   {"  i"," ii","ii ",iii}
>> (1 row)
>>
>> But when using tr_TR as default locale of the database the following
>> happens:
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>>   lower
>> -------
>>   ııı
>>
>> postgres=# select show_trgm('III');sü
>>                 show_trgm
>> ---------------------------------------
>>   {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>>
>> I'm wondering if that's intentional to begin with. Shouldn't the code
>> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
>> research to see how other index types handle locales.
>>
>> Coming back to the original problem: the lengthy comment at the top of
>> pg_locale_libc.c, suggests that in some cases ASCII characters are
>> handled the pg_ascii_tolower() way for the default locale. See for
>> example tolower_libc_mb(). So a character by character conversion using
>> that function will yield a different result than strlower_libc_mb(). I'm
>> wondering why that is.
>
> Hmm, yeah, that feels funny. The trigram code predates per-column
> collation support, so I guess we never really thought through how it
> should interact with COLLATE clauses.

I've written a patch to fix that. See [1].

>> Anyways, we could limit the optimization to only kick in when the used
>> locale follows the same rules as pg_ascii_tolower(). We could test that
>> when creating the locale and store that info in pg_locale_struct.
>
> I think that's only possible for libc locales, which operate one
> character at a time. In ICU locales, lower-casing a character can depend
> on the surrounding characters, so you cannot just test the conversion of
> every ascii character individually. It would make sense for libc locales
> though, and I hope the ICU functions are a little faster anyway.
>
> Although, we probably should be using case-folding rather than lower-
> casing with ICU locales anyway. Case-folding is designed for string
> matching. It'd be a backwards-compatibility breaking change, though.

Oh, I wasn't ware of that. Doing it only for libc locales seems still
useful.

Good point with the casefolding. I'll look into that.

How do we usually go about such backwards-compatibility breaking
changes? Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?

[1]
https://www.postgresql.org/message-id/flat/db087c3e-230e-4119-8a03-8b5d74956bc2%40gmail.com

--
David Geier

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2026-01-21 16:59:29 Re: [PATCH] Provide support for trailing commas
Previous Message David Geier 2026-01-21 15:36:18 Use correct collation in pg_trgm