| From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
|---|---|
| To: | David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: Use correct collation in pg_trgm |
| Date: | 2026-03-24 23:55:46 |
| Message-ID: | 2c15502fd399128ee27fbe1a305e006780159f66.camel@j-davis.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, 2026-01-21 at 16:36 +0100, David Geier wrote:
> Hi hackers,
>
> In thread [1] we found that pg_trgm always uses DEFAULT_COLLATION_OID
> for converting trigrams to lower-case. Here are some examples where
> today the collation is ignored:
>
...
> The attached patch attempts to fix that. I grepped for all
> occurrences
> of DEFAULT_COLLATION_OID in contrib/pg_trgm and use the function's
> collation OID instead DEFAULT_COLLATION_OID.
Hi,
Thank you for working on this.
This area is a bit awkward conceptually. The case you found is not
about the *sort order* of the values; it's about the casing semantics.
We mix those two concepts into a single "collation oid" that determines
both sort order and casing semantics (and pattern matching semantics,
too).
LOWER() and UPPER() take the casing semantics from the inferred
collation, so that's a good argument that you're doing the right thing
here.
But full text search does not; it uses DEFAULT_COLLATION_OID for
parsing the input. That sort of makes sense, because tsvector/tsquery
don't have a collatable sort order -- it's more about the parsing
semantics to create the values in the first place, not about how the
tsvector/tsquery values are sorted.
So that leaves me wondering: why would pg_trgm use the inferred
collation and tsvector/tsquery use DEFAULT_COLLATION_OID? They seem
conceptually similar, and the only real difference I see is that
tsvector/tsquery are types and pg_trgm is a set of functions.
Note that I made some changes here recently: full text search and ltree
used to use libc unconditionally or a mix of libc and
DEFAULT_COLLATION_OID; that was clearly wrong and I changed it to
consistently use DEFAULT_COLLATION_OID. But I didn't resolve the
conceptual problem of whether we should use the inferred collation (as
you suggest) or not.
Regards,
Jeff Davis
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Gyan Sreejith | 2026-03-25 00:12:59 | Re: [Proposal] Adding Log File Capability to pg_createsubscriber |
| Previous Message | Michael Paquier | 2026-03-24 23:46:23 | Re: Adding locks statistics |