Re: Use correct collation in pg_trgm

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: Use correct collation in pg_trgm
Date: 2026-03-27 10:56:43
Message-ID: e4ab080d-b637-472e-ba69-bdfc413084ad@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26.03.2026 19:26, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>> On Thu, 2026-03-26 at 09:50 +0100, David Geier wrote:
>>> I agree. That is inconsistent. But if anything, shouldn't we change
>>> tsvector/tsquery to as well adhere to the inferred collation?
>
>> I am not sure either way.
>> It's easy to specify a COLLATE clause to affect the interpretation of
>> the input. But once you parse the inputs into a stored value, you can't
>> later reinterpret those values by specifying a COLLATE clause. The
>> parsing already happened and the original input string was lost.
>> You can end up with a table full of values, some of which were parsed
>> with one set of semantics, and others parsed with a different set of
>> semantics. That may make sense or it may just cause confusion. It's
>> tough for me to say.
>
> The rule that text search goes by is that it's okay to be a bit
> fuzzy about this because people are usually looking for approximate
> matches, so that even if you have sets of lexemes that were extracted
> under slightly different parsing rules you can probably still find

tsquery allows to do starts-with queries equivalent to LIKE 'foo%' via
to_tsquery('foo:*'). These two would then also behave differently.

Can you see any good reason that speaks against using the inferred
collation in tsquery / tsvector?

> what you want. While that argument still works for pg_trgm's original
> "similarity" functions, it falls flat for the LIKE/ILIKE/regex index
> support functionality: people will be justifiably unhappy if the index
> doesn't find the exact same matches that a seqscan-and-filter would.

Agreed. That was also one of the motivations to change it.

> I've not experimented, but I rather imagine that things are already
> buggy as heck, in that optimizing a LIKE or regex expression that's
> got collation A applied to it into an indexscan on a pg_trgm index
> made with collation B will not work if different trigrams get
> extracted. I think we have to insist that the index collation match
> the query. Once we've done that, the concern about making a change
> like this seems less: you will not get wrong answers, rather the
> planner will refuse to use an incompatible index.

I thought that happens already. In the following example no index scan
is used, even though sequential scan is disabled. FWICS,
IndexCollMatchesExprColl() takes care of that.

CREATE EXTENSION pg_trgm;
CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
CREATE INDEX ON test USING GIN(col gin_trgm_ops);
SET enable_seqscan = FALSE;

EXPLAIN SELECT * FROM test WHERE col LIKE '%test%' COLLATE "C";
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on test (cost=10000000000.00..10000000001.01 rows=1 width=32)
Filter: (col = 'test'::text COLLATE "C")

If you have other cases in mind, pointers are appreciated.

--
David Geier

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajit Awekar 2026-03-27 10:59:57 Re: [OAuth2] Infrastructure for tracking token expiry time
Previous Message Jim Jones 2026-03-27 10:39:09 Re: XMLDocument (SQL/XML X030)