Re: insensitive collations

From: Jim Finnerty <jfinnert(at)amazon(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: insensitive collations
Date: 2021-03-24 20:41:38
Message-ID: 1616618498263-0.post@n3.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For a UTF8 encoded, case-insensitive (nondeterministic), accent-sensitive ICU
collation, a LIKE predicate can be used with a small transformation of the
predicate, and the pattern can contain multi-byte characters:

from:

SELECT * FROM locations WHERE location LIKE 'midi-Pyrené%';
-- ERROR: nondeterministic collations are not supported for LIKE

to:

SELECT * FROM locations WHERE lower(location) COLLATE "C" LIKE
lower('midi-Pyrené%');

and if there is an index defined as follows:

CREATE INDEX location_LOWER_C_index ON
locations (LOWER(location) COLLATE "C");

then the LIKE predicate above performs the desired CI_AS evaluation and also
exploits the index:

EXPLAIN VERBOSE SELECT * FROM locations WHERE LOWER(location) COLLATE "C"
LIKE LOWER('midi-Pyrené%');

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using location_LOWER_C_index on public.locations
(cost=0.13..8.16 rows=1 width=18)
Output: location
Index Cond: (((lower((locations.location)::text))::text >=
'midi-pyren?'::text) AND ((lower((locations.location)::text))::text <
'midi-pyren?'::text))
Filter: ((lower((locations.location)::text))::text ~~
'midi-pyren?%'::text)
(4 rows)

It turns out that CI_AS represents the vast majority (more than 99.9%) of
nondeterministic collations that we are seeing in babelfish, because
SQL_Latin1_General_CP1_CI_AS is the default collation in SQL Server.

Currently nondeterministic collations are disabled at the database level.
The cited reason was because of the lack of LIKE support and because certain
catalog views use LIKE. That may still need to be a limitation if those
LIKE predicates currently have an index exploitation unless we can create a
LOWER(col) COLLATE "C" index on the catalog. Which catalog views were these
that had the problem?

-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-03-24 20:46:45 Re: pg_amcheck contrib application
Previous Message Peter Eisentraut 2021-03-24 19:50:59 Re: truncating timestamps on arbitrary intervals