Re: Hash join not finding which collation to use for string hashing

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Subject: Re: Hash join not finding which collation to use for string hashing
Date: 2020-01-30 20:50:21
Message-ID: 14129.1580417421@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I assume that what would have to happen to implement this is that an
> SQL-callable function would be passed more than one collation OID,
> perhaps one per argument or something like that. Notice, however, that
> this would require changing the way that functions get called. See the
> DirectFunctionCall{1,2,3,...}Coll() and
> FunctionCall{0,1,2,3,...}Coll() and the definition of
> FunctionCallInfoBaseData -- there's only one spot for an OID available
> right now. Allowing for more would likely have a noticeable impact on
> the cost of calling SQL-callable functions, and that's already
> expensive enough that people have been unhappy about it. It seems
> unlikely that it would be worth incurring more overhead here for every
> query all the time just to make this case work.

The implementation I was visualizing was replacing, eg,
FuncExpr.inputcollid with an OID List, and then teaching PG_GET_COLLATION
to throw an error if the list is longer than one element. I agree that
the performance implications of that would be pretty troublesome, though.

In the end, it seems like the only solution that would be remotely
practical from a performance standpoint is to redefine things so that
collation-sensitive functions have to be labeled as such in pg_proc,
and then we can have the parser throw the appropriate error if it
can't resolve an input collation for such a function. Perhaps the
backwards-compatibility hit wouldn't be as bad as it first seems,
since the whole thing can be ignored for functions that haven't got at
least one collatable input, and most of those would likely be all right
with a default assumption that they are collation sensitive. Or maybe
better, we could make the default assumption be that they aren't
sensitive, with the same error still being thrown at runtime if they are,
so that extensions have to take positive action to get the better error
behavior but if they don't then things are no worse than today.

Mark, obviously, would then lobby for the pg_proc marking to
include one state that identifies functions that only care about
collation when it's nondeterministic. But I'm still not very
sure how that would work as soon as you look anyplace except at
what texteq() itself would do. The questions of whether such a
query matches a given index, or could be implemented via mergejoin,
etc, remain.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-01-30 20:57:37 Re: Enabling B-Tree deduplication by default
Previous Message Alvaro Herrera 2020-01-30 20:45:36 Re: standby apply lag on inactive servers