Skip site navigation (1) Skip section navigation (2)

Re: limit in subquery causes poor selectivity estimation

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: limit in subquery causes poor selectivity estimation
Date: 2011-08-31 10:22:04
Message-ID: 1314786124.27073.11.camel@fsopti579.F-Secure.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
> > EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2
> LIMIT 200);
> 
> > Here, however, it has apparently not passed this knowledge through
> the
> > LIMIT.
> 
> The LIMIT prevents the subquery from being flattened entirely, ie we
> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
> FROM test2 LIMIT 200)".  If you look at examine_variable in selfuncs.c
> you'll note that it punts for Vars coming from unflattened subqueries.
> 
> > So what's up with that?  Just a case of, we haven't thought about
> > covering this case yet, or are there larger problems?
> 
> The larger problem is that if a subquery didn't get flattened, it's
> often because it's got LIMIT, or GROUP BY, or some similar clause that
> makes it highly suspect whether the statistics available for the table
> column are reasonable to use for the subquery outputs.  It wouldn't be
> that hard to grab the stats for test2.sha1, but then how do you want
> to adjust them to reflect the LIMIT?

It turns out that this is a regression introduced in 8.4.8; the same
topic is also being discussed in

http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

and

http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php

This is the (previously posted) plan with 8.4.8:

                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Hash Join  (cost=10.60..34.35 rows=500 width=31)
   Hash Cond: (test1.sha1 = test2.sha1)
   ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
   ->  Hash  (cost=8.10..8.10 rows=200 width=32)
         ->  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
               ->  Limit  (cost=0.00..3.60 rows=200 width=21)
                     ->  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)

And this is the plan with 8.4.7:

                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Hash Join  (cost=10.80..34.55 rows=200 width=31)
   Hash Cond: (test1.sha1 = test2.sha1)
   ->  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
   ->  Hash  (cost=8.30..8.30 rows=200 width=32)
         ->  HashAggregate  (cost=6.30..8.30 rows=200 width=32)
               ->  Limit  (cost=0.00..3.80 rows=200 width=21)
                     ->  Seq Scan on test2  (cost=0.00..19.01 rows=1001 width=21)

I liked the old one better. ;-)



In response to

Responses

pgsql-hackers by date

Next:From: Bernd HelmleDate: 2011-08-31 12:13:11
Subject: Informix FDW - anybody working on this?
Previous:From: Srinivas AjiDate: 2011-08-31 09:59:18
Subject: BUG #6189: libpq: sslmode=require verifies server certificate if root.crt is present

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group