Re: Hash join not finding which collation to use for string hashing

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Hash join not finding which collation to use for string hashing
Date: 2020-01-29 04:03:35
Message-ID: 11AD8FE2-0E91-4865-9006-7A996BA74E67@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
>
>
>> On Jan 28, 2020, at 6:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
>>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to that patch, I’m starting this new thread.
>>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list.
>>
>> Hm, I don't see any bug here. You're asking it to join
>>
>>>> CREATE TABLE alpha (a TEXT COLLATE "ja_JP", b TEXT COLLATE "sv_SE");
>>>> CREATE TABLE beta (a TEXT COLLATE "tr_TR", b TEXT COLLATE "en_US");
>>
>>>> SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
>>
>> so t1.a and t2.a have different collations, and the system can't resolve
>> which to use for the comparison.
>>
>> Now, I'd be the first to agree that this error could be reported better.
>> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
>> what it does *not* know is whether the '=' operator cares for collation.
>> Throwing an error when the operator wouldn't care at runtime isn't going
>> to make many people happy. On the other hand, when the operator finally
>> does run and can't get a collation, all it knows is that it didn't get a
>> collation, not why. So we can't produce an error message as specific as
>> "ja_JP and tr_TR collations conflict".
>>
>> Now that the collations feature has settled in, it'd be nice to go back
>> and see if we can't improve that somehow. Not sure how.
>>
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked. But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)
>
> I contemplated that for a while before submitting the report. I agree that for strings that are not binary equal, some collations might say the two strings are equal, and other collations may say that they are not. But when does any collation say that a string is not equal to itself? All the strings in these columns were loaded from the same source table, and they should always equal themselves, so the only problem I am aware of is if some of them equal others of them under one of the collations in question, where the other collation doesn’t think so. I’m pretty sure that does not exist in this concrete example.
>
> I guess I’m arguing that the system is giving up too soon, saying, “In theory there might be values I don’t know how to compare, so I’m going to give up now and not look”.
>
> I think what is happening here is that the system thinks, “Hey, I can use a hash join for this”, and then later realizes, “Oh, no, I can’t” and instead of falling back to something other than hash join, it gives up.
>
> Is there some more fundamental reason this query couldn’t correctly be completed? I don’t mind being enlightened about the part that I’m missing.

If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than having it sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand that. If so, then let’s put this error string into the docs, because right now, if you google

site:postgresql.org "could not determine which collation to use for string hashing”

you don’t get anything from the docs telling you that this is an expected outcome.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2020-01-29 04:16:30 Re: Is custom MemoryContext prohibited?
Previous Message Mark Dilger 2020-01-29 03:38:59 Re: Hash join not finding which collation to use for string hashing