Re: Proposal for supporting outer joins in 7.1

From: Thomas Lockhart <lockhart(at)alumni(dot)caltech(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Proposal for supporting outer joins in 7.1
Date: 2000-08-26 07:21:35
Message-ID: 39A76FFF.71A24638@alumni.caltech.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I would like to see if we can't get some amount of OUTER JOIN functionality
> into the 7.1 release.

Great!

> It seems that the executor implementation should be
> pretty simple, at least for the left-join case (which is what people appear
> to need the most). What we are mostly missing is a suitable parsetree
> representation plus appropriate handling in the planner.

I've already got the merge-join code walking left, right, and full joins
(look for #ifdef code in the appropriate routines). Just didn't implment
the "null column filling", and didn't figure out how to push the
left/right/full flags into the executor. I haven't looked at the other
join techniques to see how easy those would be.

> Right now, the rangetable associated with a query serves two purposes.
> First, it provides a list of all relations used in the query for Var nodes
> to refer to (a Var's "varno" field is just an index into the rangetable
> list). Second, it drives the planner's construction of the join tree for
> the query (all entries that are marked inJoinSet will be added to the join
> tree in some order). I think what we want to do is separate these two
> functions into two data structures. The rangetable list is just fine for
> looking up Var references, but it's not sufficient information for
> representing multiple types of joins. A ready-to-plan Query node should
> have an additional subsidiary data structure that describes the required
> join tree of relations.
>
> The additional data structure could really be pretty close to the raw parse
> structure that the grammar emits for FROM clauses. I envision it as a tree
> whose leaf nodes are references to rangetable entries and whose upper-level
> nodes correspond to JOIN clauses (each JOIN node has pointers to its child
> nodes plus information about the specified join type and any supplied
> constraint conditions). If the FROM clause has more than one item, the
> top level of the join tree is a CROSS JOIN node with all the FROM-clause
> items as children. (Note that at least for this case, a join node needs
> to be able to have more than two children.)
>
> If anything, this structure should be easier for the parser/analyzer to emit
> than what it's doing now. Notice in particular that ON/USING qual clauses
> associated with joins are to be left in the join tree, *not* merged into the
> main WHERE clause. (The planner would just have to separate them out again
> and discover which join they should go with, so what's the point of merging
> them into WHERE?)
>
> I envision the planner as operating by walking this join tree and
> considering plans that perform each specified join. At nodes with more than
> two children (such as the top-level CROSS JOIN from a many-element FROM
> clause) the planner will consider all possible orders for joining the
> children together, the same as it does now. Note that this implementation
> implies that the user can guide/constrain the planner by writing an
> appropriate FROM clause. For example:
>
> SELECT ... FROM a,b,c
> Planner will consider all three possible join orders:
> (a cross join b) cross join c
> (a cross join c) cross join b
> (b cross join c) cross join a
>
> SELECT ... FROM (a cross join b) cross join c
> Planner will only consider joining a with b and then adding c.
>
> This might be construed as either a feature or a bug depending on your
> views about how much responsibility to put on the planner vs. the user.
> In any case it seems like a reasonable first-cut implementation; we can
> think about improvements later.
>
> Thoughts, objections, better ideas?
>
> One thing I'm not very clear on is how much has been done in the parser
> about the ISO rules for visibility and aliasing of table columns in
> explicit JOINs. Thomas, can you sketch out where we stand in that area?

The current code does not do the scoping rules quite right, but it isn't
very far wrong either. I think that the scoping and aliasing can be
contained within the parser code, swallowing the intermediate scopes by
resolving back to the original names. However, if we continue to do it
this way (losing intermediate alias names) then we will be making it
harder for the planner/optimizer to jiggle up the plans, since the query
tree will have already assumed some references. For example:

select * from (a join b on (i)) join c using (i = j);

is equivalent to

select * from (a join b on (i)) as t1 (i, ...) join c using (t1.i =
j);

but the parser may already recast it as

select a.i, a..., b.x, b..., c.j, c... from a, b, c
where (a.i = b.i) and (a.i = c.j);

so the optimizer would have to figure out for itself that a.i is
interchangable with b.i.

(Also, you'll remember that it currently barfs on three-way joins; I
haven't had a chance to look at that yet).

Getting outer joins for 7.1 would be great. Count me in to help...

- Thomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2000-08-26 07:26:26 Re: Performance on inserts
Previous Message Thomas Lockhart 2000-08-26 07:04:44 Re: Access PostgreSQL server via SSL/Internet