EquivalenceClasses and subqueries and PlaceHolderVars, oh my

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: EquivalenceClasses and subqueries and PlaceHolderVars, oh my
Date: 2012-03-15 01:29:49
Message-ID: 10461.1331774989@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I looked into the problem complained of here:
http://archives.postgresql.org/pgsql-bugs/2012-03/msg00016.php
It's not at all specific to custom types; you can exhibit it with
this query in the regression database:

explain select * from
(select 1 as t, unique1 from tenk1 a
union all
select 2 as t, unique1 from tenk1 b) c
where t = 2;

9.0 successfully optimizes away the excluded subquery:

QUERY PLAN
-------------------------------------------------------------------------
Result (cost=0.00..458.00 rows=10000 width=8)
-> Append (cost=0.00..458.00 rows=10000 width=8)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=8)
(3 rows)

but 9.1 and HEAD not so much:

QUERY PLAN
----------------------------------------------------------------------
Result (cost=0.00..966.00 rows=100 width=8)
-> Append (cost=0.00..966.00 rows=100 width=8)
-> Seq Scan on tenk1 a (cost=0.00..483.00 rows=50 width=8)
Filter: (1 = 2)
-> Seq Scan on tenk1 b (cost=0.00..483.00 rows=50 width=8)
Filter: (2 = 2)
(6 rows)

This is a consequence of commit 57664ed25e5dea117158a2e663c29e60b3546e1c,
which was already known to cause some issues as per commit
b28ffd0fcc583c1811e5295279e7d4366c3cae6c. This gripe is basically the
same problem: when we push the "t = 2" condition down into the subqueries,
"t" is no longer replaced with just constant 1 or constant 2, but with
those constants wrapped in PlaceHolderVar nodes. In this case that
prevents constant-folding from realizing it can simplify "1 = 2" or
"2 = 2" to constant false or true, whereas in the previous complaint
the PHV wrappers were interfering with matching expressions to indexes.

I spent a fair amount of time thinking about whether we could revert
that patch and solve the original problem some other way, but with no
real success. The original problem was reported here:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg00419.php
with an example equivalent to this variant of the previous query:

explain select * from
(select thousand as t1, tenthous as t2 from tenk1 a
union all
select 42 as t1, 42 as t2 from tenk1 b) c
order by t1, t2;

There is an EquivalenceClass for each of "t1" and "t2", and if we don't
do something like wrapping the constants with distinct PHVs, then
add_child_rel_equivalences will end up pushing identical constants into
both ECs, thus totally bollixing the fundamental rule that any expression
should match at most one EC.

Another variant of this is where there are identical Vars rather than
constants in one of the subqueries:

explain select * from
(select thousand as t1, tenthous as t2 from tenk1 a
union all
select unique2 as t1, unique2 as t2 from tenk1 b) c
order by t1, t2;

I chose this example to match existing indexes in the regression database:
the ideal plan would do an indexscan on the (thousand, tenthous) index
for the first arm, and an indexscan on the (unique2) index for the second
arm, and MergeAppend them together. In general the planner is aware that
"ORDER BY x, x" is the same as "ORDER BY x", so you'd think it could apply
that principle to the second arm of this union ... but it can't. To do
that, it would have to realize that the unique2 index matches both of the
EquivalenceClasses in this query, and that's totally outside its model of
reality.

It seems to me that to do a really nice job with this sort of situation,
we would need some more general concept than EquivalenceClasses. I'm
not sure what, though I have a vague feeling that it might look like
EquivalenceClasses that are only valid within some sub-area of a query.

Now, this is a sufficiently weird corner case that I'm not desirous of
making major planner design changes just to improve this particular
outcome (and in any case that doesn't sound like a backpatchable bug
fix). But down the road we may think of more reasons why we need a
better idea than EquivalenceClasses.

In the meantime, the best solution I've been able to think of goes like
this: continue to add PHVs on to duplicated or non-Var subquery outputs
when propagating those outputs into the outer query, but then strip them
off again when propagating transformed outer expressions down into the
sub-query. There are basically only two places where we do the latter ---
set_append_rel_pathlist in allpaths.c propagates the inheritance parent's
baserestrictlist and other attachments to child rels, and
match_eclass_clauses_to_index extracts modified join clauses from
EquivalenceClasses. So it's a bit ugly but should be a localized fix,
and it would allow us to revert b28ffd0fcc583c1811e5295279e7d4366c3cae6c
because the problem would be taken care of at a higher level. This
would not fix the problem shown in the last example, that ideally we
should be able to match an index to more than one EquivalenceClass;
the planner would just take the first match. In just about all of the
non-contrived cases I've been able to think of, this is all right because
there's only one plausible match anyhow, so I'm not terribly concerned.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-15 01:30:27 Re: Client Messages
Previous Message Robert Haas 2012-03-15 01:24:26 Re: Faster compression, again