Re: [HACKERS] PoC: full merge join on comparison clause

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: [HACKERS] PoC: full merge join on comparison clause
Date: 2018-11-19 01:46:40
Message-ID: 22877.1542592000@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru> writes:
> [ Inequality-merge-join-v10.patch ]

Just thinking about this patch a bit ... I wonder why you were so quick to
reject the UNION approach at the outset. This patch is pretty messy, and
it complicates a lot of stuff that is quite fundamental to the planner,
and you still end up that the only functionality gain is now we can handle
full joins whose conditions include a single btree inequality clause.
Nor are we doing that remarkably efficiently ... it's pretty much
impossible to do it efficiently, in fact, since if the inputs have M and N
rows respectively then the output will have something like (M*N)/2 rows.

So it seems to me that if we're going to put sweat into this area at all,
our ambition ought to be "we'll successfully perform a FULL JOIN with any
join clause whatsoever, though it might take O(M*N) time".

Now as far as I can tell, the UNION substitution you proposed is
completely valid, although it'd be better to phrase the second step
as an antijoin. That is, I believe

select * from t1 full join t2 on (anything)

is exactly equal to

select t1.*, t2.* from t1 left join t2 on (anything)
union all
select t1.*, t2.* from t2 anti join t1 on (anything)

There is one fly in the ointment, which is that we will have to run the
join clause twice, so it can't contain volatile functions --- but the
merge join approach wouldn't handle that case either.

Having to read the inputs twice is not good, but we could put them
into CTEs, which fixes any problems with volatility below the join
and at least alleviates the performance problem. Since we can't
currently do any meaningful qual pushdown through full joins, the
optimization-fence aspect of a CTE doesn't seem like an issue either.

In short, proceeding like the above when we can't find another plan
type for a full join seems like it fixes a far wider variety of cases.
The possibility that maybe we could do some of those cases a bit faster
isn't sufficiently attractive to me to justify also putting in a
mechanism like this patch proposes. We only rarely see complaints at
all about can't-do-a-full-join problems, and I do not think this patch
would fix enough of those complaints to be worthwhile.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-11-19 01:55:11 Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
Previous Message Michael Paquier 2018-11-19 01:27:29 Re: [PATCH] XLogReadRecord returns pointer to currently read page