Re: Hash join not finding which collation to use for string hashing

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: Hash join not finding which collation to use for string hashing
Date: 2020-01-30 20:29:37
Message-ID: 13291.1580416177@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> Would it make sense to catch a qual with unassigned collation
> somewhere in the planner, where the qual's operator family is
> estatblished, by checking if the operator family behavior is sensitive
> to collations?

> I’m not sure how to do that. pg_opfamily doesn’t seem to have a field for that. Can you recommend how I would proceed there?

There's no such information attached to opfamilies, which is more or less
forced by the fact that individual operators don't expose it either.
There's not much hope of making that better without incompatible changes
in the requirements for extensions to define operators and/or operator
families.

> So, for = and !=, I’m looking at the definition of texteq, and it calls check_collation_set as one of the very first things it does. That’s where the error that annoys me comes out. But I don’t think it really needs to be doing this. It should first be determining if collation *matters*.

But of course it matters. How do you know whether the operation is
deterministic if you don't know the collation?

> 3) Extend the concept of collations to collation sets. Right now, I’m only thinking about a collation set as having two values, the lefthand and the righthand side, but maybe there are other cases like (Left, (Left,Right)) that get built up and need to work. Anyway, at the point in the executor that the collations don’t match, instead of passing NULL down the line, pass in a collation set (Left, Right), and functions like texteq can see that they’re dealing with two different collations and decide if they can deal with that or if they need to throw an error.

Maybe this could work. I think it would get messy when bubbling up
collations, but as long as you're talking about "sets" not "pairs"
it might be possible to postpone collation resolution.

To me, though, the main advantage of this is that we could throw a
more explicit error like "collations "ja_JP" and "tr_TR" cannot be
unified", since that information would still be there at runtime.
I'm still pretty dubious that having texteq special-case the situation
where the collations are different but all deterministic is a reasonable
thing to do.

One practical problem is that postponing that work to runtime could be
a huge performance hit, because you'd have to do it over again on each
call of the operator. I suppose some caching might be possible.

Another issue is that you're still putting far too much emphasis on
the fact that a hash-join plan manages to avoid this error, and ignoring
the problem that a lot of other plans for the same query will not avoid
it. What if the planner had chosen a merge-join, for instance? How
useful is it to allow the join if things still break the moment you
add an ORDER BY?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2020-01-30 20:38:22 Re: BufFileRead() error signalling
Previous Message Mark Dilger 2020-01-30 20:18:05 Re: Hash join not finding which collation to use for string hashing