Re: Nondeterministic collations vs. text_pattern_ops

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Nondeterministic collations vs. text_pattern_ops
Date: 2019-09-17 15:17:32
Message-ID: 3615.1568733452@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> writes:
> On 2019-09-17 01:13, Tom Lane wrote:
>> Whilst poking at the leakproofness-of-texteq issue, I realized
>> that there's an independent problem caused by the nondeterminism
>> patch. To wit, that the text_pattern_ops btree opclass uses
>> texteq as its equality operator, even though that operator is
>> no longer guaranteed to be bitwise equality. That means that
>> depending on which collation happens to get attached to the
>> operator, equality might be inconsistent with the other members
>> of the opclass, leading to who-knows-what bad results.

> You can't create a text_pattern_ops index on a column with
> nondeterministic collation:

> create collation c1 (provider = icu, locale = 'und', deterministic = false);
> create table t1 (a int, b text collate c1);
> create index on t1 (b text_pattern_ops);
> ERROR: nondeterministic collations are not supported for operator class
> "text_pattern_ops"

Oh! I'd seen that error message, but not realized that it'd get
thrown during index creation.

> There is some discussion in internal_text_pattern_compare().

I don't much like doing it that way: looking up the determinism property
of the collation over again in every single comparison seems pretty
expensive, plus the operator is way exceeding its actual knowledge
of the call context by throwing an error phrased that way.

> Are there other cases we need to consider?

I think that disallowing indexes with this combination of opclass and
collation may actually be sufficient. A query can request equality
using any collation, but it won't get matched to an index with a
different collation, so I think we're safe against index-related
problems if we have that restriction.

AFAIR, the only other place in the system where non-default opclasses
can be invoked is ORDER BY. Somebody could write

ORDER BY textcol COLLATE "nondeterm" USING ~<~

However, I think we're actually okay on that one, because although
the equality opclass member is out of sync with the rest, it won't
get consulted during a sort. internal_text_pattern_compare will
throw an error for this, but I don't believe it actually needs to.

My recommendation is to get rid of the run-time checks and instead
put a hack like this into DefineIndex or some minion thereof:

if ((opclass == TEXT_PATTERN_BTREE_CLASS_OID ||
opclass == VARCHAR_PATTERN_BTREE_CLASS_OID ||
opclass == BPCHAR_PATTERN_BTREE_CLASS_OID) &&
!get_collation_isdeterministic(collid))
ereport(ERROR, ...);

Hard-wiring that is ugly; maybe someday we'd wish to expose "doesn't
allow nondeterminism" as a more generally-available opclass property.
But without some other examples that have a need for it, I think it's
not worth the work to create infrastructure for that. It's not like
there are no other hard-wired legacy behaviors in DefineIndex...

> I notice that there is a hash opclass text_pattern_ops, which I'd
> actually never heard of until now, and I don't see documented. What
> would we need to do about that?

Huh. I wonder why that's there --- I can't see a reason why it'd
behave differently from the regular hash text_ops. Maybe we feared
that someday it would need to be different? Anyway, I think we can
just ignore it for this purpose.

> Would it help if one created COLLATE "C" indexes instead of
> text_pattern_ops? What are the tradeoffs between the two approaches?

Of course, the pattern_ops opclasses long predate our COLLATE support.
I suspect if we'd had COLLATE we never would have invented them.
I believe the pattern_ops are equivalent to a COLLATE "C" index, both
theoretically and in terms of the optimizations we know about making.
There might be some minor performance difference from not having to
look up the collation, but not much if we've done the collate-is-c
fast paths properly.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2019-09-17 15:38:28 Re: Define jsonpath functions as stable
Previous Message Pavel Stehule 2019-09-17 15:15:42 Re: patch: psql - enforce constant width of last column