Re: join removal

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: join removal
Date: 2009-07-17 02:27:08
Message-ID: 603c8f070907161927r74d930b3t671bda8786c8b41f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 16, 2009 at 9:02 PM, Greg Stark<stark(at)mit(dot)edu> wrote:
> I started looking at this patch and it looks pretty good as far as it
> goes. But I think we can do a lot more. It seems to me the cases where
> foreign key relationships exist are likely to be really big use cases.

I agree. But that seems a lot harder, and this is useful all by
itself because it can eliminate LEFT joins. Foreign key deductions
will be necessary to eliminate inner joins and self-joins. I've been
advised that when writing patches for PostgreSQL it's best to start
with something small. :-)

> I have one big worry though. Currently you're detecting the unique
> property using the planner's path mechanism. I suppose that works, but
> it's only an accident of the planner design that the path for the
> unique index will always be there if it's the join condition. My
> instinct is that this code should be going back to the raw index info
> to prove this property. The only practical effect I can think of is
> that the plan will have to be marked as being dependent on that index
> and that will be hard to do if you haven't identified a specific index
> you're basing it on.

I had trouble figuring out where to hook in the logic. In an ideal
world, it would be nice to detect that the join is removable earlier,
but it's hard to do that, because it's not until we know the join
order that we can test whether any attributes from the inner rel are
used above the level of the join. But as it is the fact that the join
can be removed will have to be rediscovered over and over again as
planning progresses.

As for going back to "the raw index info", that was kind of my
instinct as well but I couldn't make it work. It seems that the
IndexOptInfo structure only knows the column numbers of the index's
keys, whereas the code that considers possible join strategies has
only equivalence classes to work with, and I don't see how to match
the two up. If we can figure out a way to do that it would probably
be cleaner.

> I would like to see a list of cases we plan to tackle preferably with
> example queries, as a kind of checklist so we can knock them down one
> by one.  Right now it's unclear just how much of the problem space is
> being solved.

I don't know how many cases I personally plan to handle because I
don't know how much time I'm going to have to work on this or whether
I have the needed brainpower. But I can enumerate the cases that I
know about where this is theoretically possible.

- LEFT joins can be eliminated if the nullable side of the join can be
proved unique over the join columns. The simplest and most common
case is the one where there is a unique index on any (not necessarily
proper) subset of the join columns, but it can also happen in any
other case where we can prove that the inner rel is unique over (a
subset of) the relevant columns, such as when the inner rel groups by
those columns. There is an existing function query_is_distinct_for()
that does something along these lines, but it operates on yet another
different type of data structure (a Query, rather than a list of
equivalence classes or alternatively a list of varattnos) and doesn't
handle the unique-index case, which is probably the most important one
for this optimization.

- INNER joins are more complex because what happens on the inner side
of the join can potentially wipe out rows from the result. With a
LEFT join, it's sufficient to prove that the inner rel is at least
unique enough, but for an INNER join, we have to prove that it's
exactly UNIQUE enough. I think we can only provide this when the
inner rel is a base relation with a unique index over EXACTLY (not a
subset of) the relevant columns AND there is a foreign key
relationship from the outer rel to the inner rel over the join
columns.

- Self-joins (whether they are inner, left, semi, or full) can be
collapsed into a scan of the underlying base relation if the join
columns on both sides include all the columns of the same unique
index. All the quals from both sides have to be applied.

> Incidentally, guess what other database just got this feature committed...
>
> http://askmonty.org/worklog/Client-BackLog/?tid=17

Hmm, well, it would be nice to have parity. This is a hugely
important feature for the kinds of queries I do all day.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2009-07-17 02:43:35 Review: Revise parallel pg_restore's scheduling heuristic
Previous Message Robert Haas 2009-07-17 01:38:48 Re: [PATCH] SE-PgSQL/tiny rev.2193