Re: Hash join not finding which collation to use for string hashing

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: Hash join not finding which collation to use for string hashing
Date: 2020-01-30 18:55:37
Message-ID: 1849CB5F-BEDB-4EFC-964F-B1E1499DA710@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jan 29, 2020, at 10:14 PM, Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
>
>
> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
> WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
> ERROR: could not determine which collation to use for string comparison
> HINT: Use the COLLATE clause to set the collation explicitly.
>
> With PG 11, I can see that hash join and nestloop join work. But with
> PG 12, this join can't possible work without an explicit COLLATE
> clause. So it would be nice if we can report a more specific error
> much sooner, possibly with some parser context, given that we now know
> for sure that a join qual without a collation assigned will not work
> at all. IOW, maybe we should aim for making the run-time collation
> errors to be of "won't happen" category as much as possible.
>
> Tom said:
>>>> Now, I'd be the first to agree that this error could be reported better.
>>>> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
>>>> what it does *not* know is whether the '=' operator cares for collation.
>>>> Throwing an error when the operator wouldn't care at runtime isn't going
>>>> to make many people happy. On the other hand, when the operator finally
>>>> does run and can't get a collation, all it knows is that it didn't get a
>>>> collation, not why. So we can't produce an error message as specific as
>>>> "ja_JP and tr_TR collations conflict".
>>>>
>>>> Now that the collations feature has settled in, it'd be nice to go back
>>>> and see if we can't improve that somehow. Not sure how.
>
> Would it make sense to catch a qual with unassigned collation
> somewhere in the planner, where the qual's operator family is
> estatblished, by checking if the operator family behavior is sensitive
> to collations?

Hi Amit, I appreciate your attention to my question, but I’m not ready to delve into possible fixes, as I still don’t entirely understand the problem.

According to Tom:

> (BTW, before v12 the text '=' operator indeed did not care for collation,
> so this example would've worked. But the change in behavior is a
> necessary consequence of having invented nondeterministic collations,
> not a bug.)

I’m still struggling with that, because the four collations I used in the example are all deterministic. I totally understand why having more than one collation matters if you ask that your data be in sorted order, as the system needs to know which ordering to use. But for equality, I would think that deterministic collations are all interchangeable, because they all agree on whether A = B, regardless of the collation defined on column A and/or on column B. Maybe I’m wrong about that. But that’s my reading of the definition of “deterministic collation” given in the docs:

> A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence.

I’m reading that as “If and only if”, and maybe I’m wrong to do so. Maybe that’s my error. But assuming that part is ok, it would seem to be sufficient to know that the columns being joined use deterministic collations, and you wouldn’t need them to be the *same* collations, nor even remember which collations they were. You’d just need information passed down that collations can be ignored for this comparison, or that a built-in byte-for-byte equality comparator should be used rather than the collation’s equality comparator, or some such solution.

I’m guessing I’m wrong about at least one of these things, and I’m hoping somebody enlightens me.

Thanks so much in advance,


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-01-30 19:05:14 Re: Duplicated LSN in ReorderBuffer
Previous Message Tom Lane 2020-01-30 18:54:31 Brokenness in dump/restore for GENERATED expressions