join removal

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: join removal
Date: 2009-01-19 03:56:21
Message-ID: 603c8f070901181956j9541113xe68df5985d558a97@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Apologies for submitting new patches while we're still in the midst of
CommitFest:November, but I think I've more or less come to the end of
the reviewing that I can usefully do for 8.4 (or at least, I haven't
had any requests lately, feel free...) and I don't want to wait
forever to start thinking about 8.5.

This patch is based on Simon Riggs' earlier work on join removal,
although I don't believe I've actually used any of his code. Tom
Lane's review comments on Simon's code were invaluable in getting this
to work. I also understand that Simon plans to do more work in this
area, so if this ends up getting sucked into his work or visca versa
(or this ends up getting superseded altogether), that is fine.

http://archives.postgresql.org/pgsql-patches/2008-08/msg00035.php
http://archives.postgresql.org/pgsql-patches/2008-09/msg00024.php

The theoretical underpinning of this patch is as follows: Joins can
affect the result of a query in one of two ways: either they change
the set of rows returned (by either increasing it or decreasing it),
or they provide attributes. To remove a join, we need to prove that
neither of these things is possible. It's pretty easy to test that
the join isn't providing any attributes above the level of the join,
and you'll see that rel_attrs_needed_above_join() handles that here.
Verifying that the number of rows returned isn't changed by the join
is harder.

For the sake of simplicity, this patch only attempts to handle LEFT
JOINs. Specifically, it attempts to prove that the results of
scanning the outer side of the joinrel will be identical to the
results of performing a merge-join, up to sorting. Nothing that
happens on the inner side of the join can reduce the number of output
rows (except via a WHERE-clause qual that will fail the
rel_attrs_needed_above_join test anyway), so the only thing we need to
worry about is the possibility that the join might increase the output
row count. To do that, we need to know that there will be a
sufficiently strong unique-ness property on the rows emitted by the
inner side of the join. I believe that there are two main cases in
which we could potentially prove this: the inner side of the join is
an index scan (without resort) on a unique index, or the inner side of
the join is a subselect with a group by clause over the join columns.
For now, I'm only attempting to handle the first case, though it might
be nice to eventually handle the other as well.

Simon's original patch starts by checking whether the RelOptInfo for
the relation to be dropped is a base relation, but that approach seems
somewhat limiting, because we certainly want to be able to remove
multiple joins in succession. Setting the paths for the joinrel {B C}
to the paths for B doesn't help us when we try to form {A B C},
because now the join of {A} to {B C} is not a join against a base rel
and join removal fails. The other problem is that at that level we
don't really understand what's up with the join clauses: is the join
operator even an equality operator? There is finally the problem that
while we can get at the relevant IndexOptInfo structures and figure
out which combinations of attributes have unique indices, I can't
figure out any sane way to relate those attribute numbers back to the
join clauses.

So I've taken the alternative approach of working on the level of
paths. sort_outer_and_inner(), before forming a join path, checks
whether there happens to be a unique index with the same pathkeys as
the inner side of the path. If so, and if no attributes are needed
from the inner side of the join, the it pulls up all the paths from
the outerrel into the joinrel and exits (really, it should skip the
rest of add_paths_to_joinrel() as well, but it didn't seem worth
refactoring that without some validation of the basic approach). This
approach is limiting because sort_outer_and_inner() doesn't try every
possible combination of pathkeys that might be helpful to us, but it
seems likely to be adequate here for the same reasons that
sort_outer_and_inner() gets away with it generally: the list of
mergeclauses tends to be real short.

(I think that if we were trying to optimize away an INNER join, this
approach would be inadequate, because any non-merge-joinable join
clauses could change the number of output rows. Here that can't
happen: the most they can do is force the inner side to NULL, but if
the inner side isn't being used that doesn't matter anyway. Of
course, for INNER JOINs, we'd also need to worry about the
merge-joinable clauses themselves stripping rows from the output due
to a failure to match; we'd need to check for foreign key
constraints.)

I have a feeling there are probably lingering bugs in here, but it
passes regression tests and some sample queries I tried still return
the correct answers even after nuking one or more joins.

Any comments appreciated,

...Robert

Attachment Content-Type Size
join-removal-v1.patch text/x-diff 11.1 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2009-01-19 06:11:16 Re: Review: B-Tree emulation for GIN
Previous Message ITAGAKI Takahiro 2009-01-19 01:28:12 Re: Visibility map and freezing