Re: Patch to support SEMI and ANTI join removal

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: Patch to support SEMI and ANTI join removal
Date: 2014-10-06 09:57:25
Message-ID: CAApHDvqrW2QjhLGyp5YQzk36Hb+VjhBWqtPbVE=KP5peBXDSOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 1, 2014 at 1:34 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:

> On 2014-10-01 01:03:35 +1300, David Rowley wrote:
> > On Wed, Oct 1, 2014 at 12:01 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
> > wrote:
> >
> > > On 2014-09-30 23:25:45 +1300, David Rowley wrote:
> > > >
> > > > I've not quite gotten my head around how we might stop the unneeded
> > > > relation from being the primary path to join the other inner
> relations,
> > > > i.e. what would stop the planner making a plan that hashed all the
> other
> > > > relations and planned to perform a sequence scan on the relation
> that we
> > > > have no need to scan (because the foreign key tells us the join is
> > > > pointless). If we were not use use that relation then we'd just have
> a
> > > > bunch of hash tables with no path to join them up. If we did
> anything to
> > > > force the planner into creating a plan that would suit skipping
> > > relations,
> > > > then we could possibly be throwing away the optimal plan..... Right?
> > >
> > > I'm not 100% sure I understand your problem description, but let me
> > > describe how I think this would work. During planning, you'd emit the
> > > exactly same plan as you'd today, with two exceptions:
> > > a) When costing a node where one side of a join is very likely to be
> > > removable, you'd cost it nearly as if there wasn't a join.
> > >
> >
> > Ok given the tables:
> > create table t1 (x int primary key);
> > create table t2 (y int primary key);
> >
> > suppose the planner came up with something like:
> >
> > test=# explain (costs off) select t2.* from t1 inner join t2 on
> t1.x=t2.y;
> > QUERY PLAN
> > ----------------------------
> > Hash Join
> > Hash Cond: (t1.x = t2.y)
> > -> Seq Scan on t1
> > -> Hash
> > -> Seq Scan on t2
> >
> > If we had a foreign key...
> >
> > alter table t2 add constraint t2_y_fkey foreign key (y) references t1
> (x);
> >
> > ...the join to t1 could possibly be "ignored" by the executor... but
> > there's a problem as the plan states we're going to seqscan then hash
> that
> > relation, then seqscan t1 with a hash lookup on each of t1's rows. In
> this
> > case how would the executor skip the scan on t1? I can see how this might
> > work if it was t2 that we were removing, as we'd just skip the hash
> lookup
> > part in the hash join node.
>
> Hm, right. But that doesn't seem like a fatal problem to me. The planner
> knows about t1/t2 and Seq(t1), Seq(t2), not just Hash(Seq(t2)). So it
> can tell the HashJoin node that when the 'shortcut' qualifier is true,
> it should source everything from Seq(t2). Since the sequence scan
> doesn't care about the node ontop that doesn't seem to be overly
> dramatic?
> Obviously reality makes this a bit more complicated...
>
>
Ok, after a bit of study on the hash join code I can see what you mean, it
shouldn't really matter.
I've started working on this now and I've made some changes in
analyzejoins.c so that instead of removing joins, it just marks the
RangeTblEntry, setting a new flag named skipJoinPossible to true. (I'll
think of a better name later)

I'm starting off with hash joins and I'm hacking a bit at ExecInitHashJoin.
I want to add 2 bool flags to HashJoinState, outerIsRequired and
innerIsRequired. I think if both of these flags are set then we can just
abort the join altogether. Though in order to set these flags I need to
identify which relations are for the outer and which are for the inner side
of the join. I've got the logic for that only partially worked out. My
understanding of it so far is that I just need to look at
the hjstate->js.ps.lefttree and righttree. Inner being right, and outer
being left. I'm a little stuck on more complex cases where the scan is
nested deeper in the tree and I'm not quite sure on the logic I should be
using to navigate to it.

Take the following plan: (which I've amended to mark the left and right
nodes)

explain select t1.* from t1 inner join t2 on t1.t2_id= t2.id inner join t3
on t2.t3_id=t3.id;
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=122.15..212.40 rows=2140 width=8)
Hash Cond: (t2.t3_id = t3.id)
-> Hash Join (cost=58.15..118.98 rows=2140 width=12) (left node)
Hash Cond: (t1.t2_id = t2.id)
-> Seq Scan on t1 (cost=0.00..31.40 rows=2140 width=8) (left
node)
-> Hash (cost=31.40..31.40 rows=2140 width=8) (right node)
-> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=8)
(left node)
-> Hash (cost=34.00..34.00 rows=2400 width=4) (right node)
-> Seq Scan on t3 (cost=0.00..34.00 rows=2400 width=4) (left
node)

The schema is set up in such a way that the joins to t2 and t3 can be...
"skipped", and my new code in analyzejoin.c marks the RangeTblEntry records
for this relation to reflect that.

During ExecInitHashJoin for the join between t2 and t3, if I want to find
t2 in the tree, I'd need to
do hjstate->js.ps.lefttree->righttree->lefttree... (which I know just from
looking at the explain output) I just can't work out the logic behind where
the scan node will actually be. At first I had thought something like, loop
down the lefttree path until I reach a deadend, and that's the outer scan
node, but in this case there's a right turn in there too, so that won't
work. If I keep going down the left path I'd end up at t1, which is
completely wrong.

Can anyone shed any light on how I might determine where the scan rel is in
the tree? I need to find it so I can check if the RangeTblEntry is marked
as skip-able.

Regards

David Rowley

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2014-10-06 10:07:00 Improve automatic analyze messages for inheritance trees
Previous Message Emre Hasegeli 2014-10-06 09:36:09 Re: KNN-GiST with recheck