Re: Performance improvement for joins where outer side is unique

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2017-01-26 23:39:09
Message-ID: 12115.1485473949@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

To re-familiarize myself with this patch, I've been re-reading the thread,
which has gotten quite long. It seemed like it would be a good idea to
stop and try to summarize what the patch ought to accomplish, because
there's been some drift over the more than 2 years the patch has been
in the works.

So: ISTM there are two core ideas at this point.

1. We should recognize when the inner side of a join is unique (that is,
there is provably no more than one inner tuple joining to any given outer
tuple), and make use of that knowledge to short-circuit execution in the
same way we already do for JOIN_SEMI and JOIN_ANTI cases. That is, once
we find a match we can immediately move on to the next outer tuple rather
than continuing to scan for inner-side matches.

2. In these same cases (unique/semi/anti joins), it is possible to avoid
mark/restore overhead in a mergejoin, because we can tweak the executor
logic to not require backing up the inner side. This goes further than
just tweaking the executor logic, though, because if we know we don't
need mark/restore then that actually makes some plan shapes legal that
weren't before: we don't need to interpose a Material node to protect
join inputs that can't mark/restore.

Maybe I missed something, but it doesn't look like the current patch
(unique_joins_2017-01-27_no_outer_unique.patch) has anything concerning
point #2 at all. It might make sense to address that idea as a follow-on
patch, but I think it can be a quite significant win and we shouldn't
just lose track of it.

Anyway, having laid out that scope of work, I have some concerns:

* The patch applies point #1 to only INNER and LEFT join modes, but
I don't really see why the idea wouldn't work for RIGHT and FULL modes,
ie the optimization seems potentially interesting for all executable join
types. Once you've got a match, you can immediately go to the next outer
tuple instead of continuing to scan inner. (Am I missing something?)

* Particularly in view of the preceding point, I'm not that happy with
the way that management/caching of the "is it unique" knowledge is
done completely differently for INNER and LEFT joins. I wonder if
there's actually a good argument for that or is it mostly a development
sequence artifact. IOW, would it hurt to drop the SpecialJoinInfo
tie-in and just rely on the generic cache?

* Because of where we apply the short-circuit logic in the executor,
it's only safe to consider the inner rel as unique if it is provably
unique using only the join clauses that drive the primary join mechanism
(ie, the "joinquals" not the "otherquals"). We already do ignore quals
that are pushed-down to an outer join, so that's good, but there's an
oversight: we will use a qual that is mergejoinable even if it's not
hashjoinable. That means we could get the wrong answers in a hash join.
I think probably the appropriate fix for the moment is just to consider
only clauses that are both mergeable and hashable while trying to prove
uniqueness. We do have some equality operators that support only one
or the other, but they're corner cases, and I'm dubious that it's worth
having to make separate proofs for merge and hash joins in order to
cater to those cases.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2017-01-26 23:52:45 Re: logical decoding of two-phase transactions
Previous Message Craig Ringer 2017-01-26 23:22:11 Re: BUG: pg_stat_statements query normalization issues with combined queries