Re: Hash join not finding which collation to use for string hashing

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: Hash join not finding which collation to use for string hashing
Date: 2020-01-30 20:02:41
Message-ID: 12204.1580414561@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> According to Tom:
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked. But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)

> I’m still struggling with that, because the four collations I used in
> the example are all deterministic. I totally understand why having more
> than one collation matters if you ask that your data be in sorted order,
> as the system needs to know which ordering to use. But for equality, I
> would think that deterministic collations are all interchangeable,
> because they all agree on whether A = B, regardless of the collation
> defined on column A and/or on column B. Maybe I’m wrong about that.

Well, you're not wrong, but you're assuming much finer distinctions
than the collation machinery actually makes (or than it'd be sane
to ask it to make, IMO). We don't have a way to tell texteq that
"well, we don't know what collation to assign to this operation,
but it's okay to assume that it's deterministic". Nor does the
parser have any way to know that texteq could be satisfied by
that knowledge --- if it doesn't even know whether texteq cares
about collation, how could it know that?

There are other issues here too. Just because the query could
theoretically be implemented without reference to any specific
collation doesn't mean that that's a practical thing to do.
It'd be unclear for instance whether we can safely use indexes
that *do* have specific collations attached. We'd also lose
the option to consider plans like mergejoins.

If the parser understood that a particular operator behaved
like text equality --- which it does not, and I guarantee you
I will shoot down any proposal to hard-wire a parser test for
that particular operator --- you could imagine assigning "C"
collation when we have an unresolvable combination of
deterministic collations for the inputs. That dodges the
problem of not having any way to represent the situation.
But it's still got implementation issues, in that such a
collation choice probably won't match the collations of any
indexes for the input columns.

Another issue is that collations "bubble up" in the parse tree,
so sneaking in a collation that's not supposed to be there per
spec carries a risk of causing unexpected semantics further up.
I think we could get away with that for the particular case of
equality (which returns collation-less boolean), but this is
another thing that makes the case narrower and less useful.

In the end, TBH, I'm not finding your example compelling enough
to be worth putting in weird hacks for such cases. If you're
joining columns of dissimilar collations, you're going to be
finding it necessary to specify what collation to use in a lot
of places ... so where's the value in making a weird special
case for equality?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dan Katz 2020-01-30 20:09:57 Re: ERROR: subtransaction logged without previous top-level txn record
Previous Message Mark Dilger 2020-01-30 19:43:46 Re: Hash join not finding which collation to use for string hashing