Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: James Lucas <jlucasdba(at)gmail(dot)com>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Explicit deterministic COLLATE fails with pattern matching operations on column with non-deterministic collation
Date: 2020-06-02 22:53:48
Message-ID: 666679.1591138428@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I wrote:
> I guess the path of least resistance is to change the selectivity
> functions to use the query's collation; then, if you get an error
> here you would have done so at runtime anyway. The problem of
> inconsistency with the histogram collation will be real for
> ineq_histogram_selectivity; but we had a variant of that before,
> in that always using DEFAULT_COLLATION_OID would give answers
> that were wrong for a query using a different collation.

I worked on this for awhile and came up with the attached patchset.

0001 does about the minimum required to avoid this failure, by
passing the query's collation not stacoll to operators and selectivity
functions invoked during selectivity estimation. Unfortunately, it
doesn't seem like we could sanely back-patch this, because it requires
adding parameters to several globally-visible functions. The odds
that some external code is calling those functions seem too high to
risk an ABI break. So, while I'd like to squeeze this into v13,
we still need to think about what to do for v12.

0002 addresses the mentioned problem with ineq_histogram_selectivity
by having that function actually verify that the query operator and
collation match what the pg_statistic histogram was generated with.
If they don't match, all is not lost. What we can do is just
sequentially apply the query's operator and comparison constant to
each histogram entry, and take the fraction of matches as our
selectivity estimate. This is more or less the same insight we have
used in generic_restriction_selectivity: the histogram is a pretty
decent sample of the column, even if its ordering is not quite what
you want.

0002 also deletes a hack I had put in get_attstatsslot() to insert a
dummy value into sslot->stacoll. That hack isn't necessary any longer
(because indeed we aren't using sslot->stacoll's value anywhere as of
0001), and it breaks the verification check that 0002 wants to add to
ineq_histogram_selectivity, which depends on stacoll being truthful.
I also adjusted get_variable_range() to deal with collations more
honestly.

When I went to test 0002, I found out that it broke some test cases
in privileges.sql, and the reason was rather interesting. What those
cases are relying on is getting a highly accurate selectivity
estimate for a user-defined operator, for which the only thing the
planner knows for sure is that it uses scalarltsel as the restriction
estimator. Despite this lack of knowledge, the existing code just
blithely uses the histogram as though it is *precisely* applicable
to the user-defined operator. (Which it is, since that operator is
just a wrapper around regular "<" ... but the system has no business
assuming that.) So with the patch, the case exercises the new code
path that just counts matches, and that gives us only
1/default_statistics_target resolution in the selectivity estimate;
which is not enough to get the expected plan to be selected. I worked
around this for the moment by cranking up default_statistics_target
while running the ANALYZE in that test script, but I wonder if we
should instead tweak those test cases to be more robust.

I think the combination of 0001+0002 really moves the goalposts a
long way in terms of having honest stats estimation for non-default
collations, so I'd like to sneak it into v13. As for v12, about
the only alternatives I can think of are:

1. Do nothing, reasoning that if nobody noticed for a year, this
situation is enough of a corner case that we can leave it unfixed.
Obviously that's pretty unsatisfying.

2. Change all the stats functions to pass DEFAULT_COLLATION_OID
when invoking operator functions. This is not too attractive
either because it essentially reverts 5e0928005; in fact, to avoid
breaking things completely we'd likely have to revert the part
of that commit that taught ANALYZE to collect stats using column
collations instead of DEFAULT_COLLATION_OID. Then we get into
questions like what about 6b0faf723 --- it's going to be a mess.

3. Hack things up so that the core code renames all these exposed
functions to, say, ineq_histogram_selectivity_ext() and so on,
allowing the additional arguments to exist, but the old names would
still be there as ABI compatibility wrappers. This might produce
slightly funny results for external code calling the wrappers, since
the wrappers would have to assume DEFAULT_COLLATION_OID, but it'd
avoid an ABI break at least. I don't want to propagate such a thing
into HEAD, so this would leave us with unsightly differences between
v12 and earlier/later branches -- but there aren't *that* many places
involved. (I'd envision this approach as back-porting 0001 but not
0002. For one reason, there's noplace for a wrapper to get the
additional operator OID needed for ineq_histogram_selectivity_ext.
For another, the results for the privilege test suggest that 0002
might have surprising effects on user-defined operators, so back
patching it might draw more complaints.)

Alternatives #2 and #3 would result in (different) changes in the
selectivity estimates v12 produces when considering columns with
non-default collations and/or queries using collations that don't
match the relevant columns. So that might be an argument for
doing nothing in v12; people tend not to like it when minor
releases cause unexpected plan changes. Also, #2 is probably
strictly worse than #3 on this score, since it'd move such
estimates away from reality not towards it.

Thoughts?

regards, tom lane

Attachment Content-Type Size
0001-use-query-collation-in-selectivity-estimation.patch text/x-diff 28.0 KB
0002-handle-other-collations-in-scalarineqsel.patch text/x-diff 14.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2020-06-02 23:13:02 Re: Potential G2-item cycles under serializable isolation
Previous Message Thomas Munro 2020-06-02 22:50:07 Re: FailedAssertion("!OidIsValid(def->collOid)", File: "view.c", Line: 89)