Statistics use with functions

From: Matthew Wakeling <matthew(at)flymine(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Statistics use with functions
Date: 2009-05-08 13:46:22
Message-ID: alpine.DEB.2.00.0905081428430.2341@aragorn.flymine.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


I'm running a rather complex query and noticed a peculiarity in the usage
of statistics that seriously affects the plan generated. I can extract the
relevant bit:

modmine-r9=# select * from pg_stats where tablename = 'geneflankingregion' AND attname IN ('distance', 'direction');
schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
------------+--------------------+-----------+-----------+-----------+------------+----------------------------------+------------------------------------------------+------------------+-------------
public | geneflankingregion | distance | 0 | 6 | 5 | {5.0kb,0.5kb,1.0kb,2.0kb,10.0kb} | {0.201051,0.200798,0.200479,0.199088,0.198583} | | 0.197736
public | geneflankingregion | direction | 0 | 10 | 2 | {downstream,upstream} | {0.500719,0.499281} | | 0.495437
(2 rows)

modmine-r9=# SELECT COUNT(*) FROM geneflankingregion;
count
--------
455020
(1 row)

modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE distance = '10.0kb' AND direction = 'upstream';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on geneflankingregion (cost=0.00..15507.30 rows=45115 width=213) (actual time=0.053..181.764 rows=45502 loops=1)
Filter: ((distance = '10.0kb'::text) AND (direction = 'upstream'::text))
Total runtime: 227.245 ms
(3 rows)

modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE LOWER(distance) = '10.0kb' AND LOWER(direction) = 'upstream';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on geneflankingregion
(cost=66.95..88.77 rows=11 width=213)
(actual time=207.555..357.359 rows=45502 loops=1)
Recheck Cond: ((lower(distance) = '10.0kb'::text) AND (lower(direction) = 'upstream'::text))
-> BitmapAnd
(cost=66.95..66.95 rows=11 width=0)
(actual time=205.978..205.978 rows=0 loops=1)
-> Bitmap Index Scan on geneflankingregion__distance_equals
(cost=0.00..31.34 rows=2275 width=0)
(actual time=79.380..79.380 rows=91004 loops=1)
Index Cond: (lower(distance) = '10.0kb'::text)
-> Bitmap Index Scan on geneflankingregion__direction_equals
(cost=0.00..35.35 rows=2275 width=0)
(actual time=124.639..124.639 rows=227510 loops=1)
Index Cond: (lower(direction) = 'upstream'::text)
Total runtime: 401.740 ms
(8 rows)

When I wrap the fields in the constraints in a LOWER() function, the
planner stops looking at the statistics and makes a wild guess, even
though it is very obvious from just looking what the result should be.
Embedded in a much larger query, the inaccuracy in the number of rows (11
instead of 45502) causes major planning problems. Also, why does the
BitmapAnd say zero actual rows?

I understand this probably isn't Priority No. 1, and there are some
interesting corner cases when n_distinct is higher than the histogram
width, but would it be possible to fix this one up?

Matthew

--
I would like to think that in this day and age people would know better than
to open executables in an e-mail. I'd also like to be able to flap my arms
and fly to the moon. -- Tim Mullen

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2009-05-08 15:48:05 Re: Statistics use with functions
Previous Message Viktor Rosenfeld 2009-05-08 10:17:53 Re: Indexes not used in DELETE