Re: Confirmation of bad query plan generated by 7.4

From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Shaun Thomas <sthomas(at)leapfrogonline(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Confirmation of bad query plan generated by 7.4
Date: 2006-06-13 22:39:51
Message-ID: 20060613223950.GK34196@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Jun 13, 2006 at 06:04:42PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> > Also, I'm not sure that the behavior is entirely changed, either. On a
> > 8.1.4 database I'm still seeing a difference between now() - interval
> > and a hard-coded date.
>
> It'd depend on the context, possibly, but it's easy to show that the
> current planner does fold "now() - interval_constant" when making
> estimates. Simple example:

Turns out the difference is between feeding a date vs a timestamp into the
query... I would have thought that since date is a date that the WHERE clause
would be casted to a date if it was a timestamptz, but I guess not...

stats=# explain select * from email_contrib where project_id=8 and date >= now()-'15 days'::interval;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib (cost=0.01..45405.83 rows=14225 width=24)
Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval)))
(2 rows)

stats=# explain select * from email_contrib where project_id=8 AND date >= '2006-05-29 22:09:56.814897+00'::date;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib (cost=0.00..48951.74 rows=15336 width=24)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
(2 rows)

stats=# explain select * from email_contrib where project_id=8 AND date >= '2006-05-29 22:09:56.814897+00'::timestamp;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib (cost=0.00..45472.76 rows=14246 width=24)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29 22:09:56.814897'::timestamp without time zone))
(2 rows)

Actual row count is 109071; reason for the vast difference is querying on two columns.

I know comming up with general-purpose multicolumn stats is extremely
difficult, but can't we at least add histograms for multi-column indexes?? In
this case that would most likely make the estimate dead-on, because there's an
index on project_id, date.

Details below for the morbidly curious/bored...

stats=# \d email_contrib
Table "public.email_contrib"
Column | Type | Modifiers
------------+---------+-----------
project_id | integer | not null
id | integer | not null
date | date | not null
team_id | integer |
work_units | bigint | not null
Indexes:
"email_contrib_pkey" PRIMARY KEY, btree (project_id, id, date), tablespace "raid10"
"email_contrib__pk24" btree (id, date) WHERE project_id = 24, tablespace "raid10"
"email_contrib__pk25" btree (id, date) WHERE project_id = 25, tablespace "raid10"
"email_contrib__pk8" btree (id, date) WHERE project_id = 8, tablespace "raid10"
"email_contrib__project_date" btree (project_id, date), tablespace "raid10"
"email_contrib__project_id" btree (project_id), tablespace "raid10"
"email_contrib__team_id" btree (team_id), tablespace "raid10"
Foreign-key constraints:
"fk_email_contrib__id" FOREIGN KEY (id) REFERENCES stats_participant(id) ON UPDATE CASCADE
"fk_email_contrib__team_id" FOREIGN KEY (team_id) REFERENCES stats_team(team) ON UPDATE CASCADE
Tablespace: "raid10"

stats=# explain analyze select * from email_contrib where project_id=8 and date >= now()-'15 days'::interval;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib (cost=0.01..45475.95 rows=14247 width=24) (actual time=0.294..264.345 rows=109071 loops=1)
Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval)))
Total runtime: 412.167 ms
(3 rows)

stats=# select now()-'15 days'::interval;
?column?
-------------------------------
2006-05-29 22:09:56.814897+00
(1 row)

stats=# explain analyze select * from email_contrib where project_id=8 and date >= '2006-05-29 22:09:56.814897+00';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib (cost=0.00..48951.74 rows=15336 width=24) (actual time=0.124..229.800 rows=116828 loops=1)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
Total runtime: 391.240 ms
(3 rows)

stats=# explain select * from email_contrib where project_id=8 and date >= '2006-05-29 22:09:56.814897+00'::date;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib (cost=0.00..48951.74 rows=15336 width=24)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
(2 rows)

So casting to date doesn't change anything, but dropping project_id from the
where clause certainly does...

stats=# explain analyze select * from email_contrib where date >= now()-'15 days'::interval;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on email_contrib (cost=847355.98..1256538.96 rows=152552 width=24) (actual time=74886.028..75267.633 rows=148894 loops=1)
Recheck Cond: (date >= (now() - '15 days'::interval))
-> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98 rows=152552 width=0) (actual time=74885.690..74885.690 rows=148894 loops=1)
Index Cond: (date >= (now() - '15 days'::interval))
Total runtime: 75472.490 ms
(5 rows)

That estimate is dead-on. So it appears it's yet another case of cross-column
stats. :( But there's still a difference between now()-interval and something hard-coded:

stats=# explain analyze select * from email_contrib where date >= '2006-05-29 22:09:56.814897+00'::date;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on email_contrib (cost=847355.98..1278756.22 rows=164256 width=24) (actual time=19356.752..19623.450 rows=159348 loops=1)
Recheck Cond: (date >= '2006-05-29'::date)
-> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98 rows=164256 width=0) (actual time=19356.391..19356.391 rows=159348 loops=1)
Index Cond: (date >= '2006-05-29'::date)
Total runtime: 19841.614 ms
(5 rows)

stats=# explain analyze select * from email_contrib where date >= (now()-'15 days'::interval)::date;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on email_contrib (cost=847355.98..1279988.15 rows=164256 width=24) (actual time=19099.417..19372.167 rows=159348 loops=1)
Recheck Cond: (date >= ((now() - '15 days'::interval))::date)
-> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98 rows=164256 width=0) (actual time=19099.057..19099.057 rows=159348 loops=1)
Index Cond: (date >= ((now() - '15 days'::interval))::date)
Total runtime: 19589.785 ms

Aha! It's the casting to date that changes things.

The stats target is 100...

stats=# select attname, n_distinct from pg_stats where tablename='email_contrib';
attname | n_distinct
------------+------------
project_id | 6
team_id | 4104
work_units | 6795
date | 3034
id | 35301
(5 rows)

The n_distinct for project_id and date both look about right.

stats=# select * from pg_stats where tablename='email_contrib' and attname='project_id';
-[ RECORD 1 ]-----+------------------------------------------------------------
schemaname | public
tablename | email_contrib
attname | project_id
null_frac | 0
avg_width | 4
n_distinct | 6
most_common_vals | {205,5,8,25,24,3}
most_common_freqs | {0.4273,0.419833,0.0933667,0.0514667,0.00506667,0.00296667}
histogram_bounds |
correlation | 0.605662

stats=# select relpages,reltuples from pg_class where relname='email_contrib';
relpages | reltuples
----------+-------------
996524 | 1.35509e+08

If we look at how many rows would match project_id 8 and any 15 dates...

stats=# SELECT 1.35509e+08 * 0.0933667 / 3034 * 15;
?column?
------------------------
62551.2268472313777195

We come up with something much closer to reality (116828 rows). I guess the
problem is in the histogram for date; where the last 3 values are:

2005-11-02,2006-03-05,2006-06-11
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shaun Thomas 2006-06-13 22:41:06 Re: Confirmation of bad query plan generated by 7.4
Previous Message Michael Fuhr 2006-06-13 22:34:01 Re: Solaris shared_buffers anomaly?