Re: Hash join not finding which collation to use for string hashing

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: Hash join not finding which collation to use for string hashing
Date: 2020-01-30 06:14:31
Message-ID: CA+HiwqGzOMK9_UEJ_Fu=cKnMOi1CaLfznEqHaG_xHVgO7aaWFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Mark,

On Wed, Jan 29, 2020 at 1:03 PM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> > On Jan 28, 2020, at 7:38 PM, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> >> Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> >>> While reviewing the partition-wise join patch, I ran into an issue that exists in master, so rather than responding to that patch, I’m starting this new thread.
> >>> I noticed that this seems similar to the problem that was supposed to have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread. As such, I’ve included Tom and Amit in the CC list.

Just to clarify, we only intended in the quoted thread to plug
relevant holes of the *partitioning* code, which IIRC was more
straightforward to do than appears to be the case here.

> If the answer here is just that you’d rather it always fail at planning time because that’s more deterministic than having it sometimes succeed and sometimes fail at runtime depending on which data has been loaded, ok, I can understand that. If so, then let’s put this error string into the docs, because right now, if you google
>
> site:postgresql.org "could not determine which collation to use for string hashing”
>
> you don’t get anything from the docs telling you that this is an expected outcome.

You may have noticed that it's not only hash join that bails out:

EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
QUERY PLAN
------------------------------------------------------------
Hash Join
Hash Cond: ((t2.a)::text = (t1.a)::text)
-> Seq Scan on beta t2
-> Hash
-> Seq Scan on alpha t1
Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string hashing
HINT: Use the COLLATE clause to set the collation explicitly.

SET enable_hashjoin TO off;
-- error occurs partway through ExecInitMergeJoin(), so EXPLAIN can't finish
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.

SET enable_mergejoin TO off;
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
QUERY PLAN
------------------------------------------------------------
Nested Loop
Join Filter: ((t1.a)::text = (t2.a)::text)
-> Seq Scan on beta t2
-> Materialize
-> Seq Scan on alpha t1
Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.

With PG 11, I can see that hash join and nestloop join work. But with
PG 12, this join can't possible work without an explicit COLLATE
clause. So it would be nice if we can report a more specific error
much sooner, possibly with some parser context, given that we now know
for sure that a join qual without a collation assigned will not work
at all. IOW, maybe we should aim for making the run-time collation
errors to be of "won't happen" category as much as possible.

Tom said:
> >> Now, I'd be the first to agree that this error could be reported better.
> >> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
> >> what it does *not* know is whether the '=' operator cares for collation.
> >> Throwing an error when the operator wouldn't care at runtime isn't going
> >> to make many people happy. On the other hand, when the operator finally
> >> does run and can't get a collation, all it knows is that it didn't get a
> >> collation, not why. So we can't produce an error message as specific as
> >> "ja_JP and tr_TR collations conflict".
> >>
> >> Now that the collations feature has settled in, it'd be nice to go back
> >> and see if we can't improve that somehow. Not sure how.

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Barwick 2020-01-30 06:38:54 Re: Prevent pg_basebackup running as root
Previous Message Andrew Dunstan 2020-01-30 06:10:44 Re: making the backend's json parser work in frontend code