Re: Patch to support SEMI and ANTI join removal

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Subject: Re: Patch to support SEMI and ANTI join removal
Date: 2014-10-06 14:46:09
Message-ID: CA+TgmoZ3F1U2jeOMEeKArKtNy_umbCC0Vosg+Y5st4r71Own+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 6, 2014 at 5:57 AM, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
>> Hm, right. But that doesn't seem like a fatal problem to me. The planner
>> knows about t1/t2 and Seq(t1), Seq(t2), not just Hash(Seq(t2)). So it
>> can tell the HashJoin node that when the 'shortcut' qualifier is true,
>> it should source everything from Seq(t2). Since the sequence scan
>> doesn't care about the node ontop that doesn't seem to be overly
>> dramatic?
>> Obviously reality makes this a bit more complicated...
>>
> Ok, after a bit of study on the hash join code I can see what you mean, it
> shouldn't really matter.
> I've started working on this now and I've made some changes in
> analyzejoins.c so that instead of removing joins, it just marks the
> RangeTblEntry, setting a new flag named skipJoinPossible to true. (I'll
> think of a better name later)
>
> I'm starting off with hash joins and I'm hacking a bit at ExecInitHashJoin.
> I want to add 2 bool flags to HashJoinState, outerIsRequired and
> innerIsRequired. I think if both of these flags are set then we can just
> abort the join altogether. Though in order to set these flags I need to
> identify which relations are for the outer and which are for the inner side
> of the join. I've got the logic for that only partially worked out. My
> understanding of it so far is that I just need to look at the
> hjstate->js.ps.lefttree and righttree. Inner being right, and outer being
> left. I'm a little stuck on more complex cases where the scan is nested
> deeper in the tree and I'm not quite sure on the logic I should be using to
> navigate to it.
>
> Take the following plan: (which I've amended to mark the left and right
> nodes)
>
> explain select t1.* from t1 inner join t2 on t1.t2_id= t2.id inner join t3
> on t2.t3_id=t3.id;
> QUERY PLAN
> ------------------------------------------------------------------------
> Hash Join (cost=122.15..212.40 rows=2140 width=8)
> Hash Cond: (t2.t3_id = t3.id)
> -> Hash Join (cost=58.15..118.98 rows=2140 width=12) (left node)
> Hash Cond: (t1.t2_id = t2.id)
> -> Seq Scan on t1 (cost=0.00..31.40 rows=2140 width=8) (left
> node)
> -> Hash (cost=31.40..31.40 rows=2140 width=8) (right node)
> -> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=8)
> (left node)
> -> Hash (cost=34.00..34.00 rows=2400 width=4) (right node)
> -> Seq Scan on t3 (cost=0.00..34.00 rows=2400 width=4) (left
> node)
>
> The schema is set up in such a way that the joins to t2 and t3 can be...
> "skipped", and my new code in analyzejoin.c marks the RangeTblEntry records
> for this relation to reflect that.
>
> During ExecInitHashJoin for the join between t2 and t3, if I want to find t2
> in the tree, I'd need to do hjstate->js.ps.lefttree->righttree->lefttree...
> (which I know just from looking at the explain output) I just can't work out
> the logic behind where the scan node will actually be. At first I had
> thought something like, loop down the lefttree path until I reach a deadend,
> and that's the outer scan node, but in this case there's a right turn in
> there too, so that won't work. If I keep going down the left path I'd end up
> at t1, which is completely wrong.
>
> Can anyone shed any light on how I might determine where the scan rel is in
> the tree? I need to find it so I can check if the RangeTblEntry is marked as
> skip-able.

I think you're probably going to need logic that knows about
particular node types and does the right thing for each one. I think
- but maybe I'm misunderstanding - what you're looking for is a
function of the form Oid ScansOnePlainTableWithoutQuals(). The
algorithm could be something like:

switch (node type)
{
case T_SeqScanState:
if (no quals)
return the_appropriate_oid;
return false;
case T_HashJoin:
decide whether we can ignore one side of the join
fish out the node from the other side of the join (including
reaching through the Hash node if necessary)
return ScansOnePlainTableWithoutQuals(the node we fished out);
...other specific cases...
default:
return false;
}

This seems messy, though. Can't the deferred trigger queue become
non-empty at pretty much any point in time? At exactly what point are
we making this decision, and how do we know the correct answer can't
change after that point?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-10-06 14:51:28 Re: [RFC] Incremental backup v2: add backup profile to base backup
Previous Message Marti Raudsepp 2014-10-06 14:26:38 Re: Add generate_series(numeric, numeric)