Re: Random Page Cost and Planner

From: Alexey Klyukin <alexk(at)commandprompt(dot)com>
To: David Jarvis <thangalin(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Random Page Cost and Planner
Date: 2010-05-26 09:00:18
Message-ID: 1B94E0EC-D665-4059-83DC-5D1B1E2B1009@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On May 26, 2010, at 6:50 AM, David Jarvis wrote:
>
> That said, when using the following condition, the query is fast (1 second):
>
> extract(YEAR FROM sc.taken_start) >= 1963 AND
> extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> " -> Index Scan using measurement_013_stc_idx on measurement_013 m (cost=0.00..511.00 rows=511 width=15) (actual time=0.018..3.601 rows=3356 loops=104)"
> " Index Cond: ((m.station_id = sc.station_id) AND (m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND (m.category_id = 7))"
>
> This condition makes it slow (13 seconds on first run, 8 seconds thereafter):
>
> extract(YEAR FROM sc.taken_start) >= 1900 AND
> extract(YEAR FROM sc.taken_end) <= 2009 AND
>
> " Filter: (category_id = 7)"
> " -> Seq Scan on measurement_013 m (cost=0.00..359704.80 rows=18118464 width=15) (actual time=0.008..4025.692 rows=18118395 loops=1)"
>
> At this point, I'm tempted to write a stored procedure that iterates over each station category for all the years of each station. My guess is that the planner's estimate for the number of rows that will be returned by extract(YEAR FROM sc.taken_start) >= 1900 is incorrect and so it chooses a full table scan for all rows.

Nope, it appears that the planner estimate is correct (it estimates 18118464 vs 18118464 real rows). I think what's happening there is that 18M rows is large enough part of the total table rows that it makes sense to scan it sequentially (eliminating random access costs). Try SET enable_seqsan = false and repeat the query - there is a chance that the index scan would be even slower.

> The part I am having trouble with is convincing PG to use the index for the station ID and the date range for when the station was active. Each station has a unique ID; the data in the measurement table is ordered by measurement date then by station.
>
> Should I add a clustered index by station then by date?
>
> Any other suggestions are very much appreciated.

Is it necessary to get the data as far as 1900 all the time ? Maybe there is a possibility to aggregate results
from the past years if they are constant.

Regards,
--
Alexey Klyukin <alexk(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Cédric Villemain 2010-05-26 09:48:08 Re: shared_buffers advice
Previous Message David Jarvis 2010-05-26 06:13:45 Re: Random Page Cost and Planner