Re: Random Page Cost and Planner

From: David Jarvis <thangalin(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Random Page Cost and Planner
Date: 2010-05-26 03:50:09
Message-ID: AANLkTikpsmzgv6R7yv7DPTlTEmFhmkJhTwOdB3W5tPh1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi, Tom.

Yes, that is what happened, making the tests rather meaningless, and giving
me the false impression that the indexes were being used. They were but only
because of cached results. When multiple users making different queries, the
performance will return to ~80s per query.

I also tried Kevin's suggestion, which had no noticeable effect:
effective_cache_size = 512MB

That said, when using the following condition, the query is fast (1 second):

extract(YEAR FROM sc.taken_start) >= 1963 AND
extract(YEAR FROM sc.taken_end) <= 2009 AND

" -> Index Scan using measurement_013_stc_idx on
measurement_013 m (cost=0.00..511.00 rows=511 width=15) (actual
time=0.018..3.601 rows=3356 loops=104)"
" Index Cond: ((m.station_id = sc.station_id) AND
(m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND (m.category_id
= 7))"

This condition makes it slow (13 seconds on first run, 8 seconds
thereafter):

* extract(YEAR FROM sc.taken_start) >= 1900 AND
* extract(YEAR FROM sc.taken_end) <= 2009 AND

" Filter: (category_id = 7)"
" -> Seq Scan on measurement_013 m
(cost=0.00..359704.80 rows=18118464 width=15) (actual time=0.008..4025.692
rows=18118395 loops=1)"

At this point, I'm tempted to write a stored procedure that iterates over
each station category for all the years of each station. My guess is that
the planner's estimate for the number of rows that will be returned by
*extract(YEAR
FROM sc.taken_start) >= 1900* is incorrect and so it chooses a full table
scan for all rows. Even though the lower bound appears to be a constant
value of the 1900, the average year a station started collecting data was 44
years ago (1965), and did so for an average of 21.4 years.

The part I am having trouble with is convincing PG to use the index for the
station ID and the date range for when the station was active. Each station
has a unique ID; the data in the measurement table is ordered by measurement
date then by station.

Should I add a clustered index by station then by date?

Any other suggestions are very much appreciated.

Dave

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2010-05-26 04:13:42 Re: Function scan/Index scan to nested loop
Previous Message Rob Wultsch 2010-05-26 00:56:58 Re: Random Page Cost and Planner