Re: Random Page Cost and Planner

From: David Jarvis <thangalin(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Random Page Cost and Planner
Date: 2010-05-26 18:26:53
Message-ID: AANLkTimw-OvwIJ2t_L4kAB7itZFxfDogCjbC0WnPDyRG@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi, Kevin.

below something in the range of 1.5 to 2 is probably not going to be
> a good choice for the mix as a whole.
>

Good to know; thanks.

> This should probably be set to something on the order of 3GB. This
> will help the optimizer make more intelligent choices about when use
> of the index will be a win.
>

I'll try this.

> times. You'll get your best information if you can simulate a
> more-or-less realistic load, and try that with various settings and
>

I have no idea what a realistic load will be. The system is still in
development and not open to the general public. I also don't know how much
publicity the system will receive when finished. Could be a few hundred
hits, could be over ten thousand.

I want the system to be ready for the latter case, which means it needs to
return data for many different query parameters (date span, elevation, year,
radius, etc.) in under two seconds.

> indexes. The cache turnover and resource contention involved in
> production can influence performance, and are hard to estimate any
> other way.
>

Another person suggested to take a look at the data.

I ran a query to see if it makes sense to split the data by year. The
trouble is that there are 110 years and 7 categories. The data is already
filtered into child tables by category (that is logical because reporting on
two different categories is nonsensical -- it is meaningless to report on
snow depth *and* temperature: we already know it needs to be cold for snow).

count;decade start; decade end; min date; max date
3088;1990;2000;"1990-01-01";"2009-12-31"
2925;1980;2000;"1980-01-01";"2009-12-31"
2752;2000;2000;"2000-01-01";"2009-12-31"
2487;1970;1970;"1970-01-01";"1979-12-31"
2391;1980;1990;"1980-02-01";"1999-12-31"
2221;1980;1980;"1980-01-01";"1989-12-31"
1934;1960;2000;"1960-01-01";"2009-12-31"
1822;1960;1960;"1960-01-01";"1969-12-31"
1659;1970;1980;"1970-01-01";"1989-12-31"
1587;1960;1970;"1960-01-01";"1979-12-31"
1524;1970;2000;"1970-01-01";"2009-12-31"

The majority of data collected by weather stations is between 1960 and 2009,
which makes sense because transistor technology would have made for
(relatively) inexpensive automated monitoring stations. Or maybe there were
more people and more taxes collected thus a bigger budget for weather study.
Either way. ;-)

The point is the top three decades (1990, 1980, 2000) have the most data,
giving me a few options:

- Split the seven tables twice more: before 1960 and after 1960.
- Split the seven tables by decade.

The first case gives 14 tables. The second case gives 102 tables (at 2.5M
rows per table) as there are about 17 decades in total. This seems like a
manageable number of tables as the data might eventually span 22 decades,
which would be 132 tables.

Even though the users will be selecting 1900 to 2009, most of the stations
themselves will be within the 1960 - 2009 range, with the majority of those
active between 1980 and 2009.

Would splitting by decade improve the speed?

Thank you very much.

Dave

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message David Jarvis 2010-05-26 18:55:37 Re: Random Page Cost and Planner
Previous Message tv 2010-05-26 18:16:28 Re: Random Page Cost and Planner