Re: updated join removal patch

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: updated join removal patch
Date: 2009-09-17 21:14:23
Message-ID: 603c8f070909171414p5135aa58q186397232ad341e1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 17, 2009 at 4:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Sep 15, 2009 at 9:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>>> Here we go again.  Following Tom's advice to not insert crocks in
>>>> add_path(), I've instead introduced a noopjoin path type which ignores
>>>> its inner side.
>
> I've committed this with some revisions.  As to the points discussed
> earlier:

Thanks, it looks nice. I wonder if we shouldn't add some documentation
or regression tests, though - any thoughts?

>>> * I'm not really happy with the "NoopJoinPath" representation. ...
>>> A possible compromise is to use T_SubqueryScan as the pathtype, with
>>> the idea that we're pretending a SubqueryScan is to be inserted and then
>>> immediately optimized away.  But I don't want the inner path in there
>>> in any case.
>
>> I don't have strong feelings about it.  I thought about making just a
>> Noop path type, but I couldn't see any clear reason to prefer it one
>> way or the other.
>
> I ended up using a NoOpPath struct with pathtype = T_Join, which is a
> dummy superclass nodetag that never really gets instantiated.  The
> SubqueryScan idea looked too messy after I remembered that createplan.c
> likes to switch() on the pathtype, so having two different pathtype
> uses of T_SubqueryScan seemed pretty ugly.  But T_Join isn't really used
> anywhere, despite being nominally a executable plan type.  It's pretty
> much a judgment call whether that's really cleaner than using the path's
> own NodeTag...

Yeah. I haven't been able to wrap my head around why it's good to
have two tags on some of these things.

>>> * I'm quite unimpressed with the refactoring you did to make a single
>>> function deal with merge clauses, hash clauses, *and* join removal
>>> clauses.  I think that makes it close to unmaintainable and is buying
>>> little if anything speedwise.
>
>> You're the committer; I'm not.  But I completely disagree.  There
>> isn't any reason at all to duplicate this logic in two separate
>> places, let alone three.  I'd actually be in favor of merging the
>> existing two cases even if we weren't adding join removal.
>
> No, I still think this was a bad idea.  There are *parts* of those
> tests that are similar, but combining them all into one function is
> just a recipe for bugs.

Having read your commit, it makes more sense to me. The fact that
we're now looking at innerrel->baserestrictinfo also is a pretty
powerful argument for your way.

>> As for a GUC, I think it would be useful to have for debugging, or in
>> case of bugs.  It's really another join strategy, and we have enable_*
>> flags for all the others.  But if you don't want it for some reason,
>> I'm not in a position to twist your arm.
>
> I left it out for the moment.  If anyone can point to a case where the
> join removal logic slows planning noticeably without gaining anything,
> we can reconsider.

Well, the purpose of such flags is for debugging, not production. The
only argument I can see for not having one for join removal versus
nestloop, mergejoin, etc. is that it's pretty easy to kill removal of
a particular join by tweaking the output list, whereas it is not
comparably easy to get the planner to pick a different join type
otherwise.

>>> * Not entirely sure where to put the code that does the hard work
>>> (relation_is_distinct_for).
>
>> Yeah, it seemed a little weird to me.  For a while I was reusing some
>> of the support functions that query_is_distinct_for() calls, but the
>> final version doesn't.  I wonder if it should just be moved to
>> joinpath.c; it seems to fit in better with what is going on there.
>
> I ended up putting the index-specific logic in indxpath.c, which aside
> from seeming reasonable made it easier to deal with expression indexes.
> If we ever add another test method that doesn't depend on indexes, we
> can put it in a reasonable place and have joinpath.c call that too.

Sounds fine to me. I think that the next project in this area should
be to try to support removal of INNER joins. (Removal of SEMI, ANTI,
and FULL joins seems unlikely ever to be interesting.) That will
require teaching the planner about foreign key constraints, which
interestingly opens up some other interesting optimization
opportunities. An INNER join that can never match zero rows can
alternatively be implemented as a LEFT join, which can be reordered in
some situations where an inner join can't. e.g. A LJ (B IJ C ON Pbc)
ON Pab can be implemented as (A LJ B ON Pab) LJ/IJ C ON Pbc if it is
the case that for every row in B, there will be at least one row in C
such that Pbc holds.

Thanks again for fixing this up, and committing it. I have a TON of
queries that will benefit from this, and it's one of the reasons why I
started reading this mailing list on a regular basis and getting
involved in development. For me, it's worth an upgrade by itself.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-09-17 21:22:17 Re: [COMMITTERS] pgsql: CVS NULL Documentation Clearify documentation of CVS's output of
Previous Message Bruce Momjian 2009-09-17 21:13:36 Re: Fwd: Copy out wording