Quick Links

limit in subquery causes poor selectivity estimation

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	limit in subquery causes poor selectivity estimation
Date:	2011-08-27 11:33:27
Message-ID:	1314444807.2349.10.camel@vanquo.pezone.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

This is an artificial test case shrunk down from a much larger real
query.

CREATE TABLE test1 (
sha1 bytea PRIMARY KEY,
something text
);

CREATE TABLE test2 (
sha1 bytea PRIMARY KEY,
blah text
);

Fill those with 1000 random rows each, e.g.,

for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test1 VALUES (decode('$sha1','hex'), 'blah$i$i')"; done
for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d test -c "INSERT INTO test2 VALUES (decode('$sha1','hex'), 'foo$i')"; done

(Doesn't really matter whether the key values are the same or
overlapping or not. Just to make it interesting.)

ANALYZE;

EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2);
QUERY PLAN
----------------------------------------------------------------------
Hash Semi Join (cost=30.52..61.27 rows=1000 width=27)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27)
-> Hash (cost=18.01..18.01 rows=1001 width=21)
-> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

That's OK. Apparently it can tell that joining two tables on their
primary keys cannot result in more rows than the smaller table. (Or
can it?)

EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
QUERY PLAN
----------------------------------------------------------------------------------
Hash Join (cost=10.60..33.35 rows=500 width=27)
Hash Cond: (test1.sha1 = test2.sha1)
-> Seq Scan on test1 (cost=0.00..17.00 rows=1000 width=27)
-> Hash (cost=8.10..8.10 rows=200 width=32)
-> HashAggregate (cost=6.10..8.10 rows=200 width=32)
-> Limit (cost=0.00..3.60 rows=200 width=21)
-> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)

Here, however, it has apparently not passed this knowledge through the
LIMIT.

The problem is that in the real query, instead of the 500 up there it
estimates about 30 million (which might be a reasonable estimate for a
join between two unkeyed columns), when it should in fact be 200.
(And again, this is part of a larger query, which is then completely
messed up because of this misestimation.)

So what's up with that? Just a case of, we haven't thought about
covering this case yet, or are there larger problems?

Responses

Re: limit in subquery causes poor selectivity estimation at 2011-08-27 17:32:54 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dean Rasheed	2011-08-27 11:43:57	Re: Inputting relative datetimes
Previous Message	Dean Rasheed	2011-08-27 11:29:41	Re: Inputting relative datetimes