Re: insensitive collations

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: insensitive collations
Date: 2019-02-21 08:36:12
Message-ID: b6b28f54-a667-87c2-870f-c9f07dd796ee@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2019-02-21 03:17, Peter Geoghegan wrote:
> I wonder if it would be better to break this into distinct commits?

I thought about that. Especially the planner/executor changes could be
done separately, sort of as a way to address the thread
"ExecBuildGroupingEqual versus collations". But I'm not sure if they
would have good test coverage on their own. I can work on this if
people think this would be useful.

> * Why is support for non-deterministic comparisons limited to the ICU
> provider? If that's the only case that could possibly be affected,
> then why did we ever add the varstrcmp() tie-breaker call to strcmp()
> in the first place? The behavior described in the commit message of
> bugfix commit 656beff5 describes a case where Hungarian text caused
> index corruption by being strcoll()-wise equal but not bitwise equal.
> Besides, I think that you can vendor your own case insensitive
> collation with glibc, since it's based on UCA.

The original test case (described here:
https://www.postgresql.org/message-id/27064.1134753128%40sss.pgh.pa.us)
no longer works, so it was probably fixed on the glibc side. The git
log of the hu_HU locale definition shows that it has been "fixed" in
major ways several times over the years, so that's plausible.

I tried reproducing some more practical scenarios involving combining
diacritics, but glibc apparently doesn't believe strings in different
normal forms are equal. At that point I gave up because this doesn't
seem worthwhile to support.

Moreover, I think allowing this would require a "trusted" strxfrm(),
which is currently disabled.

>> + return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
>> + DEFAULT_COLLATION_OID,
>> + oldvalue, newvalue));
>> }
>
> The existing restriction on ICU collations (that they cannot be
> database-wide) side-steps the issue.
>
> * Can said restriction somehow be lifted? That seems like it would be
> a lot cleaner.

Lift what restriction? That ICU collations cannot be database-wide?
Sure that would be nice, but it's a separate project. Even then, I'm
not sure that we would allow a database-wide collation to be
nondeterministic. That would for example disallow the use of LIKE,
which would be weird. In any case, the above issue can be addressed
then. I think it's not worth complicating this right now.

> * Why have you disable this optimization?:
>
>> /* Fast pre-check for equality, as discussed in varstr_cmp() */
>> - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>> + if ((!sss->locale || sss->locale->deterministic) &&
>> + len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>
> I don't see why this is necessary. A non-deterministic collation
> cannot indicate that bitwise identical strings are non-equal.

Right, I went too far there.

> * Perhaps you should add a "Tip" referencing the feature to the
> contrib/citext documentation.

Good idea.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message mitani 2019-02-21 08:44:00 [PATCH] Allow transaction to partition table using FDW
Previous Message Jamison, Kirk 2019-02-21 08:20:55 RE: Timeout parameters