| From: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
|---|---|
| To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Reduce build times of pg_trgm GIN indexes |
| Date: | 2026-01-21 15:45:06 |
| Message-ID: | 66620ec7-0f81-4813-9cf1-b901a56efcc3@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
>> Oh, that's evil. I had tested that specifically. But it only worked
>> because the code in master uses str_tolower() with
>> DEFAULT_COLLATION_OID. So using a different locale like in the following
>> example does something different than when creating a database with the
>> same locale.
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>> lower
>> -------
>> ııı
>>
>> postgres=# select show_trgm('III' COLLATE "tr_TR");
>> show_trgm
>> -------------------------
>> {" i"," ii","ii ",iii}
>> (1 row)
>>
>> But when using tr_TR as default locale of the database the following
>> happens:
>>
>> postgres=# select lower('III' COLLATE "tr_TR");
>> lower
>> -------
>> ııı
>>
>> postgres=# select show_trgm('III');sü
>> show_trgm
>> ---------------------------------------
>> {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>>
>> I'm wondering if that's intentional to begin with. Shouldn't the code
>> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
>> research to see how other index types handle locales.
>>
>> Coming back to the original problem: the lengthy comment at the top of
>> pg_locale_libc.c, suggests that in some cases ASCII characters are
>> handled the pg_ascii_tolower() way for the default locale. See for
>> example tolower_libc_mb(). So a character by character conversion using
>> that function will yield a different result than strlower_libc_mb(). I'm
>> wondering why that is.
>
> Hmm, yeah, that feels funny. The trigram code predates per-column
> collation support, so I guess we never really thought through how it
> should interact with COLLATE clauses.
I've written a patch to fix that. See [1].
>> Anyways, we could limit the optimization to only kick in when the used
>> locale follows the same rules as pg_ascii_tolower(). We could test that
>> when creating the locale and store that info in pg_locale_struct.
>
> I think that's only possible for libc locales, which operate one
> character at a time. In ICU locales, lower-casing a character can depend
> on the surrounding characters, so you cannot just test the conversion of
> every ascii character individually. It would make sense for libc locales
> though, and I hope the ICU functions are a little faster anyway.
>
> Although, we probably should be using case-folding rather than lower-
> casing with ICU locales anyway. Case-folding is designed for string
> matching. It'd be a backwards-compatibility breaking change, though.
Oh, I wasn't ware of that. Doing it only for libc locales seems still
useful.
Good point with the casefolding. I'll look into that.
How do we usually go about such backwards-compatibility breaking
changes? Could we have pg_upgrade reindex all GIN indexes? Would that be
acceptable?
[1]
https://www.postgresql.org/message-id/flat/db087c3e-230e-4119-8a03-8b5d74956bc2%40gmail.com
--
David Geier
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Eisentraut | 2026-01-21 16:59:29 | Re: [PATCH] Provide support for trailing commas |
| Previous Message | David Geier | 2026-01-21 15:36:18 | Use correct collation in pg_trgm |