Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: James Lucas <jlucasdba(at)gmail(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
Date: 2020-05-28 00:21:26
Message-ID: 5075.1590625286@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

James Lucas <jlucasdba(at)gmail(dot)com> writes:
> explain select * from ctestnd where t like 'a%' collate "C";
> ERROR: nondeterministic collations are not supported for LIKE

Yeah. I traced through this, and the place where it's failing is where
the planner tries to apply the LIKE operator to the stored MCV values
(to see how many of them pass the condition, which gives us a big clue
about the selectivity). Unfortunately, per the comments in selfuncs.c,

* For both oprrest and oprjoin functions, the operator's input collation OID
* (if any) is passed using the standard fmgr mechanism, so that the estimator
* function can fetch it with PG_GET_COLLATION(). Note, however, that all
* statistics in pg_statistic are currently built using the relevant column's
* collation. Thus, in most cases where we are looking at statistics, we
* should ignore the operator collation and use the stats entry's collation.
* We expect that the error induced by doing this is usually not large enough
* to justify complicating matters. In any case, doing otherwise would yield
* entirely garbage results for ordered stats data such as histograms.

mcv_selectivity is following this advice and applying LIKE with the
ctestnd.t column's declared collation ... and then the operator throws
an error.

The idea that using the "wrong" collation might actually cause an error
was not factored into this design, obviously. I'm not sure offhand what
to do about it. If we go over to using the query's collation then we
avoid that issue, but instead we have the problem noted in this comment
about the histogram sort order not matching what the operator expects.
(In the case of mcv_selectivity the sort order isn't really an issue,
but it is an issue for sibling functions such as
ineq_histogram_selectivity.)

This issue only dates back to commit 5e0928005; before that, we just
blindly passed DEFAULT_COLLATION_OID to operators being evaluated for
estimation purposes. (I suppose if you made the database's default
collation nondeterministic, you could still get into trouble; but that
case may not be reachable right now.) On the other hand, the actual
breakage is even newer, because nondeterministic collations weren't
added until 5e1963fb7, several months later. Both of those are v12
cycle, so it's academic from a user's standpoint which one we blame;
but the upshot is that this case doesn't work.

Ideally, no operator would ever throw an error about unsupported
collations, but I suppose that day is far away.

I guess the path of least resistance is to change the selectivity
functions to use the query's collation; then, if you get an error
here you would have done so at runtime anyway. The problem of
inconsistency with the histogram collation will be real for
ineq_histogram_selectivity; but we had a variant of that before,
in that always using DEFAULT_COLLATION_OID would give answers
that were wrong for a query using a different collation.

Peter, any other thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-05-28 05:37:19 Re: BUG #16467: XX000 fun
Previous Message PG Bug reporting form 2020-05-27 22:09:09 BUG #16467: XX000 fun