Re: Wrong results with equality search using trigram index and non-deterministic collation

From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Wrong results with equality search using trigram index and non-deterministic collation
Date: 2026-05-07 13:05:29
Message-ID: 79e467049c97b0b96f1b5d714644a1f09cef1c54.camel@cybertec.at
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2026-05-04 at 13:53 +0200, David Geier wrote:
> >
> Attached patch makes your case work, including the % case. It builds on
> top of the other patches from [1] that makes pg_trgm use the inferred
> collation trigram extraction.
>
> Instead of using btint4cmp() to compare trigrams, the patch uses a
> collation-aware string comparison function.

Thanks! I tried your patch, and it does indeed fix the bug I reported.

I looked at your patch, and it is pretty straightforward.
("git am" complained about an empty line at the end of
"pg_trgm--1.6--1.7.sql", but that's merely cosmetic.)

> This is just a PoC. I haven't given much thought to the details but e.g.
> when three consecutive characters exceed 3 bytes then compact_trigram()
> uses a truncated 32-bit hash value as trigram instead. Such trigrams
> won't work in all cases. We could omit them from the query string but
> for languages where the majority of trigrams are hashed or where the
> query string consists of only a few trigrams, the look-up performance
> would suffer.

Does that mean that you could end up with wrong results (which would not
be acceptable), or that you could end up with false positives that
later get eliminated by the recheck (which would be fine)?

I am worried about collations that have digraphs - the letters would be
split when trigrams are formed, and that might cause trouble.

And indeed, I am able to break it with a "quadrigraph":

CREATE COLLATION crazy (
PROVIDER = icu,
LOCALE = 'da-DK',
DETERMINISTIC = FALSE,
RULES = '& a = zzzz'
);

CREATE TABLE boom2 (id integer PRIMARY KEY, t text COLLATE crazy);

INSERT INTO boom2 VALUES (1, 'myad'), (2, 'myzzzzd');

SELECT * FROM boom2 WHERE t = 'myad';

id │ t
════╪═════════
1 │ myad
2 │ myzzzzd
(2 rows)

CREATE INDEX trgm_idx2 ON boom2 USING gin (t gin_trgm_ops);

SET enable_seqscan = off;

SELECT * FROM boom2 WHERE t = 'myad';

id │ t
════╪══════
1 │ myad
(1 row)

> I guess better would be using a collation-aware hash function that maps
> different values that compare equal to the same hash value. hashtext()
> does that already. The new comparison function would then have to
> distinguish between plain text trigrams and hash trigrams.
> Alternatively, we could store all trigrams as hashes but that would
> break functions such as show_trgm().

But that would probably not fix the above problem, right?

My initial thought about this bug was to just not consider a trigram
index if a non-deterministic collation is involved, but I can't see
how that could be done in the planner.

Still, I think that the first two patches of your set do the right thing.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ayush Tiwari 2026-05-07 13:06:37 Re: [PATCH] Fix duplicate errmsg in ALTER TABLE SPLIT PARTITION
Previous Message Tomas Vondra 2026-05-07 12:54:18 Re: Why clearing the VM doesn't require registering vm buffer in wal record