Re: Reference to parent query from ANY sublink

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>, Antonin Houska <antonin(dot)houska(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reference to parent query from ANY sublink
Date: 2013-12-11 20:13:38
Message-ID: 1386792818.58854.YahooMailNeo@web162902.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> I applied it to master and ran the regression tests, and one of
> the subselect tests failed.
>
> This query:
>
> SELECT '' AS six, f1 AS "Correlated Field", f3 AS "Second
> Field"
>   FROM SUBSELECT_TBL upper
>   WHERE f1 IN
>     (SELECT f2 FROM SUBSELECT_TBL WHERE CAST(upper.f2 AS float) = f3);

> [ ... ] during the `make check` or `make install-check` [ ... ]
> is missing the last two rows.  Oddly, if I go into the database
> later and try it, the rows show up.  It's not immediately
> apparent to me what's wrong.

Using the v2 patch, with the default statistics from table
creation, the query modified with an alias of "lower" for the
second reference, just for clarity, yields a plan which generates
incorrect results:

 Hash Join  (cost=37.12..80.40 rows=442 width=12) (actual time=0.059..0.064 rows=3 loops=1)
   Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = lower.f2))
   ->  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16) (actual time=0.006..0.007 rows=8 loops=1)
   ->  Hash  (cost=34.12..34.12 rows=200 width=12) (actual time=0.020..0.020 rows=5 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         ->  HashAggregate  (cost=32.12..34.12 rows=200 width=12) (actual time=0.014..0.018 rows=6 loops=1)
               ->  Seq Scan on subselect_tbl lower  (cost=0.00..27.70 rows=1770 width=12) (actual time=0.002..0.004 rows=8 loops=1)
 Total runtime: 0.111 ms

As soon as there is a VACUUM and/or ANALYZE it generates a plan
more like what the OP was hoping for:

 Hash Semi Join  (cost=1.20..2.43 rows=6 width=12) (actual time=0.031..0.036 rows=5 loops=1)
   Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = lower.f2))
   ->  Seq Scan on subselect_tbl upper  (cost=0.00..1.08 rows=8 width=16) (actual time=0.004..0.007 rows=8 loops=1)
   ->  Hash  (cost=1.08..1.08 rows=8 width=12) (actual time=0.012..0.012 rows=7 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         ->  Seq Scan on subselect_tbl lower  (cost=0.00..1.08 rows=8 width=12) (actual time=0.003..0.005 rows=8 loops=1)
 Total runtime: 0.074 ms

By comparison, without the patch this is the plan:

 Seq Scan on subselect_tbl upper  (cost=0.00..5.59 rows=4 width=12) (actual time=0.022..0.037 rows=5 loops=1)
   Filter: (SubPlan 1)
   Rows Removed by Filter: 3
   SubPlan 1
     ->  Seq Scan on subselect_tbl lower  (cost=0.00..1.12 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=8)
           Filter: ((upper.f2)::double precision = f3)
           Rows Removed by Filter: 4
 Total runtime: 0.066 ms

When I run the query with fresh statistics and without EXPLAIN both
ways, the unpatched is consistently about 10% faster.

So pulling up the subquery can yield an incorrect plan, and even
when it yields the "desired" plan with the semi-join it is
marginally slower than using the subplan, at least for this small
data set.  I think it's an interesting idea, but it still needs
work.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-12-11 20:31:24 Re: Extension Templates S03E11
Previous Message Gavin Flower 2013-12-11 20:12:20 Re: ANALYZE sampling is too good