ICU collation support is incomplete for replacing citext in 10.beta4

From: Rusty Conover <Rusty(dot)Conover(at)twosigma(dot)com>
To: "'pgsql-bugs(at)postgresql(dot)org'" <pgsql-bugs(at)postgresql(dot)org>
Subject: ICU collation support is incomplete for replacing citext in 10.beta4
Date: 2017-09-07 18:05:11
Message-ID: 7f0120e8945c4befac964777d31912d7@exmbdft5.ad.twosigma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The support for ICU provided collations in PostgreSQL 10.beta4 adds the ability to have case insensitive collations. Which is a great thing, because citext can be very slow!

Collations in ICU allow a number of parameters to be specified that control their behavior, one of those parameters is called strength.

Strength is explained here:

http://userguide.icu-project.org/collation/concepts#TOC-Comparison-Levels

To create a collation that uses a strength level of primary (one that only compares base characters) use this:

create collation test_ncs (locale = 'und(at)colStrength=primary', provider = 'icu');

Sadly, the collation is successfully created but PostgreSQL doesn't respect the result of ICU's collation functionality.

This can be demonstrated by this short SQL:

-- test_ncs is non case sensitive
create table test (v text collate "test_ncs");
insert into test (v) values ('abc');
insert into test (v) values ('aBc');

-- Should return two rows
select * from test where v = 'ABC';

-- doesn't return any rows.
v
---
(0 rows)

Reviewing the implementation of the text, varchar, bpchar and char data types I found some problems that are likely the cause.

## String data type implementation notes

texteq(), textne() - These functions call memcmp() if the lengths of two text Datums don't differ. Strings could still be considered to be equal by the collation even if in memory their representations differ.

Calls to memcmp():

https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/varlena.c#L1666
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/varlena.c#L1695

These functions should both be changed to call text_cmp with the current collation from the function invocation information.

varstr_cmp(), varstrfastcmp_locale() - Both of these functions compare the strings using strcmp() if the collation returns a value of zero. This prevents the output of the collation function from being respected with regards to equality. More plainly, strcmp() is always case sensitive, the collation we previously defined above is not.

Call to strcmp():

https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/varlena.c#L1601

This opens a line of inquiry, should ICU's collation function and other collation providers be used for string equality? It seems we should respect the users intent when the collation was specified for the expression or the column.

I've prepared some diffs to resolve the issues around the code not trusting the collation's equality for the varchar, text and char data
types. I will post them if requested.

## Hashing Strings Using Collations

An ICU collation may say that two strings are equal when strcmp() or memcmp() says they are not equal. The hash values for strings should incorporate the use of the collation as well. Because if a collation considers a pair of strings to be equal they should hash to the same
value. If this is not the case strings that are equal may be in different hash buckets so that when traversing the contents of a hash
table the equivalent string values may not be enumerated together.

ICU offers an API where strings can be converted to what ICU calls Sort Keys. Sort Key representations are then able to be passed to the hash_any() function to produce hash values that respect the collation's beliefs.

ICU's description of Sort Keys:

http://userguide.icu-project.org/collation/architecture#TOC-Sort-Keys

By requiring a collation to be used when hashing a value, it does complicate the query's execution. For an example look at these statements:

--
create table test1 (v text collate "C")
select v from test1 except select v from test1;
--

In execTuplesMatch() the equality function is called using FunctionCall2().

https://github.com/postgres/postgres/blob/master/src/backend/executor/execGrouping.c#L115

FunctionCall2() passes InvalidOid for the collation Oid to use to perform the equality operation.

https://github.com/postgres/postgres/blob/master/src/include/fmgr.h#L605

With my changes to the implementation of the string data types, the code raises an error that it attempting to compare two strings without a valid collation Oid.

There seems to be a need to extend the various functionality and structures for comparing tuples, hash tables for aggregates and the
SetOp nodes. Am I incorrect?

Would it be reasonable to assume that a collation would be able to be specified for a GROUP BY like it can be specified for an ORDER BY? It
does appear that by allowing collations to be specified as an expression this is possible:

select v collate "C" from test1 group by (v collate "C") order by (v collate "C") collate "POSIX";

For instance, you could group without respect to case when the original collation of the column is case sensitive.

Before continuing efforts into resolving these collation challenges, I'd like to know if this is considered a bug, or simply out of scope of PostgreSQL 10? There are additional difficulties using collations when it comes to the functionality of string functions such as LIKE, position() and regular expressions.

Thank you,

Rusty

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2017-09-07 18:43:33 Re: ICU collation support is incomplete for replacing citext in 10.beta4
Previous Message Rusty Conover 2017-09-07 18:01:26 ICU collation support is incomplete for replacing citext in 10.beta4