Re: Performance improvement for joins where outer side is unique

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2017-01-31 00:10:02
Message-ID: CAKJS1f8VF4aAASK51qmNjDXnahoJTUHqPes6EfJiA+KthmcqJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 28 January 2017 at 05:44, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
>>> hmm. I'm having trouble understanding why this is a problem for Unique
>>> joins, but not for join removal?
>
>> Ah, you know what, that's just mistaken. I was thinking that we
>> short-circuited the join on the strength of the hash (or merge) quals
>> only, but actually we check all the joinquals first. As long as the
>> uniqueness proof uses only joinquals and not conditions that will end up
>> as otherquals, it's fine.
>
> Actually, after thinking about that some more, it seems to me that there
> is a performance (not correctness) issue here: suppose that we have
> something like
>
> select ... from t1 left join t2 on t1.x = t2.x and t1.y < t2.y
>
> If there's a unique index on t2.x, we'll be able to mark the join
> inner-unique. However, short-circuiting would only occur after
> finding a row that passes both joinquals. If the y condition is
> true for only a few rows, this would pretty nearly disable the
> optimization. Ideally we would short-circuit after testing the x
> condition only, but there's no provision for that.

I've attached a patch which implements this, though only for
MergeJoin, else I'd imagine we'd also need to ensure all proofs used
for testing the uniqueness were also hash-able too. I added some XXX
comments in analyzejoin.c around the mergeopfamilies == NIL tests to
mention that Merge Join depends on all the unique proof quals having
mergeopfamilies. This also assumes we'll never use some subset of
mergejoin-able quals for a merge join, which could be an interesting
area in the future, as we might have some btree index on a subset of
those columns to provide pre-sorted input. In short, it's a great
optimisation, but I'm a little scared we might break it one day.

Implementing this meant removing the match_first_row_only being set
for JOIN_SEMI, as this optimisation only applies to unique_inner and
not JOIN_SEMI alone. This also means we should be checking for unique
properties on JOIN_SEMI now too, which I've enabled.

Also, I've removed all jointype checks from innerrel_is_unique(). I
took a bit of time to think about JOIN_UNIQUE_OUTER and
JOIN_UNIQUE_INNER. I think these are fine too, in fact one of the
regression test plans moved away from using a Semi Join due to proving
that the inner side was unique. That could perhaps allow a better
plan, since the join order can be swapped. I'd really like someone
else to have a think about that too, just to make sure I've not
blundered that.

I've put the join removal code back to the way it was before. As I
mentioned, we can't use the caches here since we're also using
additional quals for proofs in this case.

I wasn't sure if I should add some regression tests which exercises
MergeJoin a bit to test the new optimisation. Any thoughts on that?

David

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
unique_joins_2017-01-31.patch application/octet-stream 86.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-01-31 00:13:40 Re: Improvements in psql hooks for variables
Previous Message Tom Lane 2017-01-30 23:54:50 Re: Query fails when SRFs are part of FROM clause (Commit id: 69f4b9c85f)