Re: Performance improvement for joins where outer side is unique

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2016-04-02 10:26:35
Message-ID: CAKJS1f8naKkOueF0yRaTPzErzyLHjcOCwKvVnr5a4rv4doNzHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 2 April 2016 at 05:52, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
> > On 12 March 2016 at 11:43, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> It seems like the major intellectual complexity here is to figure out
> >> how to detect inner-side-unique at reasonable cost. I see that for
> >> LEFT joins you're caching that in the SpecialJoinInfos, which is probably
> >> fine. But for INNER joins it looks like you're just doing it over again
> >> for every candidate join, and that seems mighty expensive.
>
> > I have a rough idea for this, but I need to think of it a bit more to
> > make sure it's bulletproof.
>
> Where are we on this? I had left it alone for awhile because I supposed
> that you were going to post a new patch, but it's been a couple weeks...

Apologies for the delay. I was fully booked.

I worked on this today to try and get it into shape.

Changes:

* Added unique_rels and non_unique_rels cache to PlannerInfo. These
are both arrays of Lists which are indexed by the relid, which is
either proved to be unique by, or proved not to be unique by each
listed set of relations. Each List item is a Bitmapset containing the
relids of the relation which proved the uniqueness, or proved no
possibility of uniqueness for the non_unique_rels case. The
non_unique_rels cache seems far less useful than the unique one, as
with the standard join search, we start at level one, comparing
singleton relations and works our way up, so it's only towards the end
of the search that we'll have more rels on each side of the join
search. Many of the unique cases are found early on in the search, but
the non-unique cases must be rechecked as more relations are added to
the search, as that introduces a new possibility of uniqueness. In
fact, the only cache hit of the non-unique case in the regression
tests is a test that's using the GEQO, so I'm not quite sure if the
standard join search will ever have cache hit here. Never-the-less I
kept the cache, as it's likely going to be most useful to have it when
the GEQO is running the show, as that's when we're going to see the
largest number of relations to join.

There's also a small quirk which could lead to false negatives for
uniqueness which might still need ironed out: I don't yet attempt to
get the minimum set of relations which proved the uniqueness. This,
perhaps, does not matter much for the standard join search, but likely
will, more so for GEQO cases. Fixing this will require changes to
relation_has_unique_index_for() to get it to record and return the
relids of the clauses which matched the index.

* Renamed JOIN_LEFT_UNIQUE to JOIN_SEMI_LEFT. It seems better to
maintain the SEMI part. I thought about renaming JOIN_SEMI to
JOIN_SEMI_INNER, but thought I'd better not touch that here.

* Made a pass to update comments in the areas where I've added
handling for JOIN_SEMI_LEFT.

* I also updated comments in the regression tests to remove reference
to unique joins. "Join conversion" seems to be a better term now.

There was also a couple of places where I wasn't quite sure how to
handle JOIN_SEMI_LEFT. I've marked these with an XXX comment. Perhaps
how to handle them is more clear to you.

I also was not quite sure the best place to allocate memory for these
caches. I've ended up doing that in setup_simple_rel_arrays(), which
is likely the wrong spot. I originally did this in
standard_join_search(), after join_rel_level is allocated memory, but
I soon realised that it can't happen here as the GEQO requires the
caches too, and naturally it does not come through
standard_join_search().

I was not quite sure if I should update any docs to mention that Inner
Joins can become Semi Joins in some cases.

I was also a bit unsure if I should move the two new functions I added
to joinpath.c into analyzejoins.c.

Thanks for taking an interest in this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
unique_joins_6f34aa1_2016-04-02.patch application/octet-stream 60.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-04-02 11:55:50 Re: Speed up Clog Access by increasing CLOG buffers
Previous Message Michael Paquier 2016-04-02 08:03:22 Re: fd.c: flush data problems on osx