WIP patch for LATERAL subqueries

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: WIP patch for LATERAL subqueries
Date: 2012-08-05 21:58:07
Message-ID: 29406.1344203887@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been idly amusing myself by trying to hack up support for
SQL-standard LATERAL subqueries. I've got something that turns over,
more or less:

regression=# select * from int4_tbl a, lateral (select unique1,unique2 from tenk1 b where a.f1 = unique1) x;
f1 | unique1 | unique2
----+---------+---------
0 | 0 | 9998
(1 row)

regression=# explain select * from int4_tbl a, lateral (select unique1,unique2 from tenk1 b where a.f1 = unique1) x;
QUERY PLAN
-----------------------------------------------------------------------------------
Nested Loop (cost=0.00..42.55 rows=5 width=12)
-> Seq Scan on int4_tbl a (cost=0.00..1.05 rows=5 width=4)
-> Index Scan using tenk1_unique1 on tenk1 b (cost=0.00..8.28 rows=1 width=8)
Index Cond: (a.f1 = unique1)
(4 rows)

but there's a good deal of work left to do, some of which could use some
discussion.

Feature/semantics issues:

Currently the patch only implements the syntax called out in the standard,
namely that you can put LATERAL in front of a <derived table>, which is
to say a parenthesized sub-SELECT in FROM. It strikes me that it might be
worth allowing LATERAL with a function-in-FROM as well. So basically
LATERAL func(args) <alias>
would be an allowed abbreviation for
LATERAL (SELECT * FROM func(args)) <alias>
Since the standard doesn't have function-in-FROM, it has nothing to say
about whether this is sane or not. The argument for this is mainly that
SRFs are one of the main use-cases for LATERAL (replacing SRF-in-the-
SELECT-list usages), so we might as well make it convenient. Any opinions
pro or con about that?

While fooling around in the planner I realized that I have no idea what
outer-level aggregates mean in a LATERAL subquery, and neither does
Postgres:
regression=# select 1 from tenk1 a, lateral (select * from int4_tbl b where f1 = max(a.unique1)) x;
ERROR: plan should not reference subplan's variable
I don't see anything prohibiting this in SQL:2008, but ordinarily this
would be taken to be an outer-level aggregate, and surely that is not
sensible in the LATERAL subquery. For the moment it seems like a good
idea to disallow it, though I am not sure where is a convenient place
to test for such things. Has anyone got a clue about whether this is
well-defined, or is it simply an oversight in the spec?

Parser issues:

I'm reasonably happy with the grammar patch, though tempted to refactor
it to reduce the amount of duplication (and would be more tempted if we
add LATERAL function calls). I'm thinking that an opt_alias production
could be used to eliminate the duplication, and am also strongly tempted
to move the error for no subselect alias out of the grammar and into
transformRangeSubselect.

Note that I made LATERAL be col_name_keyword. It can no longer be allowed
as a function name because this would be formally ambiguous:
LATERAL ((SELECT x FROM t)) t(x)
Is that a call on a function named LATERAL with a scalar-subquery
argument, or is it a LATERAL subquery with extra parentheses? However,
there seems no point in making it fully reserved. The <table_ref>
productions would still have to be repeated, because even with LATERAL
fully reserved, we can't combine them using an "opt_lateral" production.
On seeing "(" at the start of a FROM item, the parser doesn't know enough
to decide whether it should reduce opt_lateral to empty, which would be
the appropriate thing if the "(" starts a sub-select but not if it is,
say, a parenthesized JOIN tree. We could only avoid that by allowing
opt_lateral before every type of table_ref and then throwing explicit
errors for the disallowed cases, which doesn't end up making the grammar
simpler.

Although lateral cross-references work okay for the successive-FROM-items
case, they don't work at all yet for JOIN cases:

regression=# select * from int4_tbl a join lateral (select unique1,unique2 from tenk1 b where f1 = unique1) x on true;
ERROR: column "f1" does not exist
LINE 1: ...ateral (select unique1,unique2 from tenk1 b where f1 = uniqu...
^

regression=# select * from int4_tbl a join lateral (select unique1,unique2 from tenk1 b where a.f1 = unique1) x on true;
ERROR: invalid reference to FROM-clause entry for table "a"
LINE 1: ...ateral (select unique1,unique2 from tenk1 b where a.f1 = uni...
^
HINT: There is an entry for table "a", but it cannot be referenced from this part of the query.

The reason that the separate-FROM-items case works is that
transformFromClause pushes each FROM-clause item into p_relnamespace and
p_varnamespace immediately after parsing it, making those names visible
during parsing of subsequent FROM items. However, transformFromClauseItem
doesn't push the left-hand item into the lists before parsing the
right-hand item.

Now, the way this is being done currently is really pretty broken anyway.
As Andrew Gierth noted some time ago in
http://archives.postgresql.org/message-id/87ocpjscpa.fsf@news-spur.riddles.org.uk
it is incorrect to make these names visible to non-LATERAL subqueries,
because they may capture what should have been a valid reference to a
parent-level variable. Furthermore, it's pretty grotty to allow the
reference and then have to re-scan the subquery in transformRangeSubselect
to see if we allowed anything we should have disallowed.

What I'm thinking of, but have not yet tried to code, is that the
p_relnamespace and p_varnamespace lists should be divided into pairs
(so four lists altogether per ParseState). p_relnamespace/p_varnamespace
should always contain exactly those RTEs that are validly referenceable
by qualified or unqualified Vars (respectively) at the current point in
parsing. The new lists, say p_relnamespace_lateral/p_varnamespace_lateral,
contain RTEs that are validly referenceable inside a LATERAL subquery
occuring at the current point in this ParseState's query. We'd also want
a p_lateral_active boolean to show whether we're inside a LATERAL
subquery; that is what would tell variable lookup whether it should search
the p_xxx_lateral lists. With a data structure like this, I think we
can fix things so that only valid references are ever accepted and there
is no need for rechecking in transformRangeSubselect (or
transformJoinOnClause for that matter). The main reason for the current
arrangement is to be able to throw useful errors in case of an illegal
lateral reference; but I think we can still do that at the point of the
illegal reference, by groveling through the whole p_rtable list looking
to see if there would have been a match (which is more or less what
searchRangeTable already does for qualified references, so we'd just be
extending that approach to unqualified names).

One fine point here is that a LATERAL subquery in the RHS of a JOIN clause
is only allowed to reference the LHS of the JOIN when the join type is not
RIGHT or FULL. The way I am inclined to implement this is to not add the
LHS to the p_xxx_lateral lists when the join type is wrong, so that the
LHS is simply not in scope in the RHS. If you read SQL:2008 carefully,
their notion of how to handle this seems to be that the LHS *is* in scope
(since section 7.6 <table reference> syntax rule 6a doesn't say anything
about join types) but then you have to throw an error if the LHS is
actually referenced and the join type is wrong (section 7.7 <joined table>
syntax rule 2). To do it exactly like they say, we'd need to add some
kind of annotation to the p_xxx_lateral list items when they're on the
wrong side of a join, which would be a real PITA I think. In normal
cases, the simpler implementation would just lead to a different error
message. But it's conceivable that it would accept a query as valid (by
resolving an ambiguous reference as matching some outer query level) when
the spec would say it's invalid. I'm inclined to think that the spec is
simply poorly thought out here. I note that the prohibition against
RIGHT/FULL joins is not there in SQL:99, so they definitely weren't
thinking straight then, and the section 7.7 rule looks like a band-aid
over the SQL:99 mistake rather than a good fix. So I don't feel too bad
about deviating in this corner case, but I wonder if anyone else feels
differently, and if so whether they have an idea for a clean
implementation that matches the spec exactly.

Planner issues:

For the moment, I've hacked prepjointree.c's is_simple_subquery() to
prevent pull-up of LATERAL subqueries. This is just to restrict the scope
of the planner changes to SubqueryScan paths/plans. Relaxing the
restriction will require making sure that all other path/plan types can be
parameterized, which is something I figure can be left for later. This
does mean there are cases that won't be optimized as nicely as one could
wish, but for a work-in-progress implementation that doesn't bother me.

The thing that is most worrisome in this area is that heretofore, the
planner has assumed that every relation has at least one unparameterized
path; but for a LATERAL subquery, or anything we might pull up out of it,
there are no such paths. The main implication of this in the code is that
a RelOptInfo's cheapest_startup_path/cheapest_total_path might not exist,
at least not with the current definition that they're the cheapest
unparameterized paths. For the moment I've dealt with this by lobotomizing
a lot of places where these paths were assumed to not be NULL. I think
however that that's unduly constraining join planning: basically, we won't
ever consider a merge or hash join involving a still-parameterized lateral
subquery, and that's probably not good. I'm considering altering the
definitions of these fields to be "the cheapest minimally-parameterized
paths", but haven't quite decided if that's a good idea or not. There
are various areas such as GEQO that will crash on lateral subqueries
pending a resolution of this, because there wasn't any easy way to just
make them punt.

Also, there are at least three places --- extract_lateral_references,
set_subquery_pathlist, and identify_nestloop_extparams --- that are
independently re-deriving information about the sets of lateral references
in a subquery. This seems like a bit of a crock; I'd be happier if we
could do the work just once in some fashion. I note that SubLink
processing has a very similar problem of needing to pull out all the
upper references to form "args" lists, too. Not sure how to refactor
that, but maybe we should expect the parser to provide annotation about
outer refs instead of making the planner re-derive it?

Executor issues:

AFAICT, there aren't any. Sweet. The parameterization work I did two
years ago held up.

Comments, better ideas?

regards, tom lane

Attachment Content-Type Size
lateral-1.patch text/x-patch 44.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2012-08-05 22:58:42 Re: WIP patch for LATERAL subqueries
Previous Message Mary F. Masterson 2012-08-04 20:59:34 Pgadmin3 v1.14.2 foreign keys