| From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
|---|---|
| To: | David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Reduce build times of pg_trgm GIN indexes |
| Date: | 2026-01-12 22:10:03 |
| Message-ID: | 2e11134f-02c3-43da-8c39-fb520a1a251d@iki.fi |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>>> of text is actually ASCII. Hence, we provide a fast path for this case
>>> which is exercised if the MSB of the current character is unset.
>>
>> This uses pg_ascii_tolower() when for ASCII characters when built with
>> the IGNORECASE. I don't think that's correct, if the proper collation
>> would do something more complicated for than what pg_ascii_tolower() does.
>
> Oh, that's evil. I had tested that specifically. But it only worked
> because the code in master uses str_tolower() with
> DEFAULT_COLLATION_OID. So using a different locale like in the following
> example does something different than when creating a database with the
> same locale.
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III' COLLATE "tr_TR");
> show_trgm
> -------------------------
> {" i"," ii","ii ",iii}
> (1 row)
>
> But when using tr_TR as default locale of the database the following
> happens:
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III');sü
> show_trgm
> ---------------------------------------
> {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>
> I'm wondering if that's intentional to begin with. Shouldn't the code
> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
> research to see how other index types handle locales.
>
> Coming back to the original problem: the lengthy comment at the top of
> pg_locale_libc.c, suggests that in some cases ASCII characters are
> handled the pg_ascii_tolower() way for the default locale. See for
> example tolower_libc_mb(). So a character by character conversion using
> that function will yield a different result than strlower_libc_mb(). I'm
> wondering why that is.
Hmm, yeah, that feels funny. The trigram code predates per-column
collation support, so I guess we never really thought through how it
should interact with COLLATE clauses.
> Anyways, we could limit the optimization to only kick in when the used
> locale follows the same rules as pg_ascii_tolower(). We could test that
> when creating the locale and store that info in pg_locale_struct.
I think that's only possible for libc locales, which operate one
character at a time. In ICU locales, lower-casing a character can depend
on the surrounding characters, so you cannot just test the conversion of
every ascii character individually. It would make sense for libc locales
though, and I hope the ICU functions are a little faster anyway.
Although, we probably should be using case-folding rather than
lower-casing with ICU locales anyway. Case-folding is designed for
string matching. It'd be a backwards-compatibility breaking change, though.
- Heikki
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Melanie Plageman | 2026-01-12 22:27:30 | Re: Buffer locking is special (hints, checksums, AIO writes) |
| Previous Message | Peter Smith | 2026-01-12 21:54:34 | Re: Proposal: Conflict log history table for Logical Replication |