Re: insensitive collations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: insensitive collations
Date: 2019-02-21 02:17:45
Message-ID: CAH2-Wzm2DhYvHYzOqD49rdkQ_MJ-Vhqvv3_noow6ui+R35goZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 19, 2019 at 6:47 AM Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> Another patch to address merge conflicts.

Some remarks on this:

* Your draft commit message says:

> This patch makes changes in three areas:
>
> - CREATE COLLATION DDL changes and system catalog changes to support
> this new flag.
>
> - Many executor nodes and auxiliary code are extended to track
> collations. Previously, this code would just throw away collation
> information, because the eventually-called user-defined functions
> didn't use it since they only cared about equality, which didn't
> need collation information.
>
> - String data type functions that do equality comparisons and hashing
> are changed to take the (non-)deterministic flag into account. For
> comparison, this just means skipping various shortcuts and tie
> breakers that use byte-wise comparison. For hashing, we first need
> to convert the input string to a canonical "sort key" using the ICU
> analogue of strxfrm().

I wonder if it would be better to break this into distinct commits?

* Why is support for non-deterministic comparisons limited to the ICU
provider? If that's the only case that could possibly be affected,
then why did we ever add the varstrcmp() tie-breaker call to strcmp()
in the first place? The behavior described in the commit message of
bugfix commit 656beff5 describes a case where Hungarian text caused
index corruption by being strcoll()-wise equal but not bitwise equal.
Besides, I think that you can vendor your own case insensitive
collation with glibc, since it's based on UCA.

This restriction feels like it's actually a kludge to work around the
fact that database-wide collations have this foreign key related
restriction in your patch:

> @@ -2901,11 +2921,20 @@ ri_AttributesEqual(Oid eq_opr, Oid typeid,
> }
>
> /*
> - * Apply the comparison operator. We assume it doesn't care about
> - * collations.
> - */
> - return DatumGetBool(FunctionCall2(&entry->eq_opr_finfo,
> - oldvalue, newvalue));
> + * Apply the comparison operator.
> + *
> + * Note: This function is part of a call stack that determines whether an
> + * update to a row is significant enough that it needs checking or action
> + * on the other side of a foreign-key constraint. Therefore, the
> + * comparison here would need to be done with the collation of the *other*
> + * table. For simplicity (e.g., we might not even have the other table
> + * open), we'll just use the default collation here, which could lead to
> + * some false negatives. All this would break if we ever allow
> + * database-wide collations to be nondeterministic.
> + */
> + return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
> + DEFAULT_COLLATION_OID,
> + oldvalue, newvalue));
> }

The existing restriction on ICU collations (that they cannot be
database-wide) side-steps the issue.

* Can said restriction somehow be lifted? That seems like it would be
a lot cleaner.

* Why have you disable this optimization?:

> /* Fast pre-check for equality, as discussed in varstr_cmp() */
> - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
> + if ((!sss->locale || sss->locale->deterministic) &&
> + len1 == len2 && memcmp(a1p, a2p, len1) == 0)

I don't see why this is necessary. A non-deterministic collation
cannot indicate that bitwise identical strings are non-equal.

* Perhaps you should add a "Tip" referencing the feature to the
contrib/citext documentation.
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2019-02-21 02:19:26 Re: proposal: variadic argument support for least, greatest function
Previous Message Tom Lane 2019-02-21 02:11:06 Re: NOT IN subquery optimization