Re: Bad estimate with partial index

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: André Hänsel <andre(at)webkr(dot)de>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bad estimate with partial index
Date: 2022-04-19 18:01:46
Message-ID: 2568735.1650391306@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=?iso-8859-1?Q?Andr=E9_H=E4nsel?= <andre(at)webkr(dot)de> writes:
> I have a case where Postgres chooses the wrong index and I'm not sure what
> to do about it:

The core problem here seems to be a poor estimate for the selectivity
of "WHERE cropped AND NOT resized":

regression=# EXPLAIN ANALYZE
SELECT count(*) FROM t
WHERE cropped AND NOT resized ;
...
-> Bitmap Heap Scan on t (cost=35.26..6352.26 rows=91100 width=0) (actual time=0.121..0.190 rows=1000 loops=1)
Recheck Cond: (cropped AND (NOT resized))
...

I think this is because the planner expects those two columns to be
independent, which they are completely not in your test data. Perhaps
that assumption is more true in your real-world data, but since you're
here complaining, I suppose not :-(. What you can do about that, in
recent Postgres versions, is to create extended statistics on the
combination of the columns:

regression=# create statistics t_stats on cropped, resized from t;
CREATE STATISTICS
regression=# analyze t;
ANALYZE
regression=# EXPLAIN ANALYZE
SELECT count(*) FROM t
WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3145.15..3145.16 rows=1 width=8) (actual time=9.765..9.766 rows=1 loops=1)
-> Index Scan using idx_resized on t (cost=0.29..3142.65 rows=1000 width=0) (actual time=9.608..9.735 rows=1000 loops=1)
Filter: (cropped AND (create_date < CURRENT_DATE))
Rows Removed by Filter: 100000
Planning Time: 0.115 ms
Execution Time: 9.779 ms

Better estimate, but it's still using the wrong index :-(. If we force
use of the other one:

regression=# drop index idx_resized;
DROP INDEX
regression=# EXPLAIN ANALYZE
regression-# SELECT count(*) FROM t
regression-# WHERE cropped AND NOT resized AND create_date < CURRENT_DATE;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6795.38..6795.39 rows=1 width=8) (actual time=0.189..0.191 rows=1 loops=1)
-> Bitmap Heap Scan on t (cost=13.40..6792.88 rows=1000 width=0) (actual time=0.047..0.147 rows=1000 loops=1)
Recheck Cond: (cropped AND (NOT resized))
Filter: (create_date < CURRENT_DATE)
Heap Blocks: exact=6
-> Bitmap Index Scan on specific (cost=0.00..13.15 rows=91565 width=0) (actual time=0.035..0.035 rows=1000 loops=1)
^^^^^^^^^^
Planning Time: 0.154 ms
Execution Time: 0.241 ms

it looks like the problem is that the extended stats haven't been used
while forming the estimate of the number of index entries retrieved,
so we overestimate the cost of using this index.

That seems like a bug. Tomas?

In the meantime, maybe you could dodge the problem by combining
"cropped" and "resized" into one multivalued column, so that there's
not a need to depend on extended stats to arrive at a decent estimate.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mahesh Gouru 2022-04-19 18:01:54 DBT-5 Stored Procedure Development (2022)
Previous Message Andres Freund 2022-04-19 17:55:26 Re: TRAP: FailedAssertion("tabstat->trans == trans", File: "pgstat_relation.c", Line: 508