Re: Reduce build times of pg_trgm GIN indexes

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reduce build times of pg_trgm GIN indexes
Date: 2026-01-12 22:10:03
Message-ID: 2e11134f-02c3-43da-8c39-fb520a1a251d@iki.fi
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/01/2026 14:06, David Geier wrote:
> On 06.01.2026 18:00, Heikki Linnakangas wrote:
>> On 05/01/2026 17:01, David Geier wrote:
>>> v1-0008-Add-ASCII-fastpath-to-generate_trgm_only.patch: Typically lots
>>> of text is actually ASCII. Hence, we provide a fast path for this case
>>> which is exercised if the MSB of the current character is unset.
>>
>> This uses pg_ascii_tolower() when for ASCII characters when built with
>> the IGNORECASE. I don't think that's correct, if the proper collation
>> would do something more complicated for than what pg_ascii_tolower() does.
>
> Oh, that's evil. I had tested that specifically. But it only worked
> because the code in master uses str_tolower() with
> DEFAULT_COLLATION_OID. So using a different locale like in the following
> example does something different than when creating a database with the
> same locale.
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III' COLLATE "tr_TR");
> show_trgm
> -------------------------
> {" i"," ii","ii ",iii}
> (1 row)
>
> But when using tr_TR as default locale of the database the following
> happens:
>
> postgres=# select lower('III' COLLATE "tr_TR");
> lower
> -------
> ııı
>
> postgres=# select show_trgm('III');sü
> show_trgm
> ---------------------------------------
> {0xbbd8dd,0xf26fab,0xf31e1a,0x2af4f1}
>
> I'm wondering if that's intentional to begin with. Shouldn't the code
> instead pass PG_GET_COLLATION() to str_tolower()? Might require some
> research to see how other index types handle locales.
>
> Coming back to the original problem: the lengthy comment at the top of
> pg_locale_libc.c, suggests that in some cases ASCII characters are
> handled the pg_ascii_tolower() way for the default locale. See for
> example tolower_libc_mb(). So a character by character conversion using
> that function will yield a different result than strlower_libc_mb(). I'm
> wondering why that is.

Hmm, yeah, that feels funny. The trigram code predates per-column
collation support, so I guess we never really thought through how it
should interact with COLLATE clauses.

> Anyways, we could limit the optimization to only kick in when the used
> locale follows the same rules as pg_ascii_tolower(). We could test that
> when creating the locale and store that info in pg_locale_struct.

I think that's only possible for libc locales, which operate one
character at a time. In ICU locales, lower-casing a character can depend
on the surrounding characters, so you cannot just test the conversion of
every ascii character individually. It would make sense for libc locales
though, and I hope the ICU functions are a little faster anyway.

Although, we probably should be using case-folding rather than
lower-casing with ICU locales anyway. Case-folding is designed for
string matching. It'd be a backwards-compatibility breaking change, though.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2026-01-12 22:27:30 Re: Buffer locking is special (hints, checksums, AIO writes)
Previous Message Peter Smith 2026-01-12 21:54:34 Re: Proposal: Conflict log history table for Logical Replication