limit in subquery causes poor selectivity estimation

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: limit in subquery causes poor selectivity estimation
Date: 2011-08-27 11:33:27
Message-ID: 1314444807.2349.10.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is an artificial test case shrunk down from a much larger real
query.

CREATE TABLE test1 (
sha1 bytea PRIMARY KEY,
something text
);

CREATE TABLE test2 (
sha1 bytea PRIMARY KEY,
blah text
);

Fill those with 1000 random rows each, e.g.,

for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test1 VALUES (decode('$sha1','hex'), 'blah$i$i')"; done
for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test2 VALUES (decode('$sha1','hex'), 'foo$i')"; done

(Doesn't really matter whether the key values are the same or
overlapping or not. Just to make it interesting.)

ANALYZE;

EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2);
QUERY PLAN
----------------------------------------------------------------------
Hash Semi Join (cost=30.52..61.27 rows=1000 width=27)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27)
-> Hash (cost=18.01..18.01 rows=1001 width=21)
-> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

That's OK. Apparently it can tell that joining two tables on their
primary keys cannot result in more rows than the smaller table. (Or
can it?)

EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
QUERY PLAN
----------------------------------------------------------------------------------
Hash Join (cost=10.60..33.35 rows=500 width=27)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27)
-> Hash (cost=8.10..8.10 rows=200 width=32)
-> HashAggregate (cost=6.10..8.10 rows=200 width=32)
-> Limit (cost=0.00..3.60 rows=200 width=21)
-> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

Here, however, it has apparently not passed this knowledge through the
LIMIT.

The problem is that in the real query, instead of the 500 up there it
estimates about 30 million (which might be a reasonable estimate for a
join between two unkeyed columns), when it should in fact be 200.
(And again, this is part of a larger query, which is then completely
messed up because of this misestimation.)

So what's up with that? Just a case of, we haven't thought about
covering this case yet, or are there larger problems?

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2011-08-27 11:43:57 Re: Inputting relative datetimes
Previous Message Dean Rasheed 2011-08-27 11:29:41 Re: Inputting relative datetimes