distinct estimate of a hard-coded VALUES list

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: distinct estimate of a hard-coded VALUES list
Date: 2016-08-18 21:03:24
Message-ID: CAMkU=1xHkyPa8VQgGcCNg3RMFFvVxUdOpus1gKcFuvVi0w6Acg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have a query which contains a where clause like:

aid =ANY(VALUES (1),(45),(87), <6948 more>, (4999947))

for example:

perl -le 'print "explain (analyze) select sum(abalance) from
pgbench_accounts where aid=ANY(VALUES "; print join ",", map "($_)", sort
{$a<=>$b} map int(rand(5000000)), 1..6952; print ");"' | psql

And it gives me an explain section like:

-> HashAggregate (cost=104.28..106.28 rows=200 width=32) (actual
time=15.171..30.859 rows=6931 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..86.90 rows=6952
width=32) (actual time=0.007..6.926 rows=6952 loops=1)

So even though it knows that 6952 values have been shoved in the bottom, it
thinks only 200 are going to come out of the aggregation. This seems like
a really lousy estimate. In more complex queries than the example one
given it leads to poor planning choices.

Is the size of the input list not available to the planner at the point
where it estimates the distinct size of the input list? I'm assuming that
if it is available to EXPLAIN than it is available to the planner. Does it
know how large the input list is, but just throw up its hands and use 200
as the distinct size anyway?

If I throw at the system a massively degenerate list, and it makes bad
choices by assuming the list is distinct, I have the recourse of
uniquifying the list myself before passing it in. If I pass in a mostly
distinct list, and it makes bad choices by assuming only 200 are distinct,
I don't have much recourse aside from creating, populating, analyzing, and
then using temporary tables. Which is annoying, and causes other problems
like catalog bloat.

Is this something in the planner that could be fixed in a future version,
or is a temp table the only real solution here?

Cheers,

Jeff

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2016-08-18 21:04:52 Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Previous Message Robert Haas 2016-08-18 20:54:45 Re: Improving planner's checks for parallel-unsafety