Reimplementing UNION/INTERSECT/EXCEPT

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Reimplementing UNION/INTERSECT/EXCEPT
Date: 2000-10-02 17:04:29
Message-ID: 4939.970506269@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been looking at UNION/INTERSECT/EXCEPT with an eye to making them
work in views and subselect-in-FROM (same thing really ;-)). I had first
thought that some marginal hacking on the parsetree representation might
be enough, but after study I am realizing just how broken this code really
is. It turns out that it's not even very close to implementing the SQL
spec. SQL92 7.10, general rule 1b says that if a row R has m duplicates
in T1 and n duplicates in T2 (m >= 0, n >= 0) then:

Set operation: contains this many duplicates of R:

T1 UNION T2 1 if m > 0 or n > 0, else 0

T1 INTERSECT T2 1 if m > 0 and n > 0, else 0

T1 EXCEPT T2 1 if m > 0 and n = 0, else 0

T1 UNION ALL T2 m + n

T1 INTERSECT ALL T2 min(m, n)

T1 EXCEPT ALL T2 max(m - n, 0)

We are OK for UNION (which we do as append, sort, unique) and for
UNION ALL (which we do as append). We are not OK for INTERSECT
and EXCEPT, which the code presently tries to do as

SELECT T1 INTERSECT SELECT T2 => SELECT T1 WHERE T1 IN (SELECT T2)

SELECT T1 EXCEPT SELECT T2 => SELECT T1 WHERE T1 NOT IN (SELECT T2)

This will give the wrong number of duplicates when m > 1. It could be
made to give the right answer by adding a DISTINCT to the select, but
there's still no expansion path for implementing INTERSECT ALL and EXCEPT
ALL.

There are a bunch of internal problems too (which is why views didn't
support UNION et al to begin with), mostly due to bad choices of data
structures. I have come to the conclusion that there's no point in
half measures: throwing out this code and rewriting from scratch will
take less time than trying to patch it.

Here's what I have in mind:

Parse-tree data structure (output of gram.y): remove
unionClause/intersectClause from SelectStmt. Add a SetOperation node type
that indicates the operation type (UNION/INTERSECT/EXCEPT plus an "all"
boolean) and links to two component SelectStmts or SetOperations. There
will be no loops in this data structure, unlike the present situation.
The top-level info (optional ORDER BY) added by the SelectStmt production
will be attached to the leftmost SelectStmt leaf in the tree.

Query-tree structure (output of parse_analyze): process the leaf
SelectStmts into Queries individually, after removing the OrderBy etc
info from the leftmost leaf. Create a top-level Query that contains the
leaf Queries as subselects-in-FROM. In place of
unionClause/intersectClause, Query nodes will have a Node *setOperations
field. In the top-level Query this will contain the SetOperation tree
emitted by gram.y, but with the leaf SelectStmt nodes replaced by
RangeTblRef references to the range table entries occupied by the leaf
Queries. (Thus, still no loops or multiple links.) The top-level Query
has a dummy targetlist that exists mainly to show the union'd datatype
of each output column, and it carries any sortClause, limitClause, etc
needed for the output of the entire operation. Note that we do not need
to transform the datatypes of the leaf queries' targetlists, which
eliminates a large class of bugs that exist presently with
cross-datatype UNIONs.

Rewriter: need pay no attention at all to setOperation tree.

Plan-tree structure (output of planner): UNION/UNION ALL are handled
same as now, except that what we are appending together is not directly
the top-level plan of each leaf query, but a SubqueryScan plan scanning
the output of each leaf query. This gives us one extra level of
projection (targetlist evaluation) in which to put conversions to the
common union datatype --- without breaking the semantics of GROUP BY,
DISTINCT, and so forth in the leaf queries. INTERSECT and EXCEPT will
be handled by building SubqueryScan plans that emit the common-datatype
columns plus a resjunk boolean column that shows whether the tuple is
coming from the left or right input. The outputs of these plans are
then appended together, sorted, and fed to a new plan node type that
implements INTERSECT(ALL) and EXCEPT(ALL). It will be a simple
generalization of the Unique plan type: scan a set of successive tuples
that agree in all the non-resjunk columns, counting the number of tuples
that came from left and right sides. This gives us 'm' and 'n' for each
set, from which the spec-defined behavior can be implemented immediately.

Executor: just need new plan-node-type implementation, which is easily
derived from nodeUnique.

I am not planning to try to implement SQL's "CORRESPONDING" option for
7.1. Whenever we do get to it, it should be a fairly straightforward
extension: add the corresponding-column lists to SetOperation nodes,
and then in the plan tree, make the SubqueryScan nodes emit just these
columns and not the entire targetlists of the leaf Queries. (This is
another reason why we need the extra level of targetlist...)

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vince Vielhaber 2000-10-02 17:07:46 Re: www.postgresql.org
Previous Message Karel Zak 2000-10-02 17:00:36 Re: www.postgresql.org