> And this is what happens in the queries above - the first query covers
>> years 1963-2009, while the second one covers 1900-2009. Given the fact
>> this table contains ~40m rows, the first query returns about 0.01% (3k
>> rows) while the second one returns almost 50% of the data (18m rows). So
>> doubt this might be improved using an index ...
> I don't think that's what I'm doing.
> There are two tables involved: station_category (sc) and measurement (m).
> The first part of the query:
> extract(YEAR FROM sc.taken_start) >= 1900 AND
> extract(YEAR FROM sc.taken_end) <= 2009 AND
> That is producing a limit on the station_category table. There are, as far
> as I can tell, no stations that have been taking weather readings for 110
> years. Most of them have a lifespan of 24 years. The above condition just
> makes sure that I don't get data before 1900 or after 2009.
OK, I admit I'm a little bit condfused by the query, especially by these
sc.taken_start >= '1900-01-01'::date AND
sc.taken_end <= '1996-12-31'::date AND
m.taken BETWEEN sc.taken_start AND sc.taken_end AND
Which seems to me a little bit "convoluted". Well, I think I understand
what that means - give me all stations for a given city, collecting the
category of data at a certain time. But I'm afraid this makes the planning
much more difficult, as the select from measurements depend on the data
returned by other parts of the query (rows from category).
See this http://explain.depesz.com/s/H1 and this
I guess the planner is confused in the second case - believes it has to
read a lot more data from the measurement table, and so chooses the
sequential scan. The question is if this is the right decision (I believe
it is not).
How many rows does the query return without the group by clause? About
140000 in both cases, right?
>> by time (each year a separate partition) but you'll have to investigate
>> that on your own (depends on your use-cases).
> I cannot partition by time. First, there are 7 categories, which would
> 770 partitions if I did it by year -- 345000 rows per partition. This will
> grow in the future. I have heard there are troubles with having lots of
> child tables (too many files for the operating system). Second, the user
> the ability to pick arbitrary day ranges for arbitrary year spans.
> There's a "year wrapping" issue that I won't explain because I never get
> right the first time. ;-)
OK, I haven't noticed the table is already partitioned by category_id and
I didn't mean to partition by (taken, category_id) - that would produce a
lot of partitions. Yes, that might cause problems related to number of
files, but that's rather a filesystem related issue.
I'd expect rather issues related to RULEs or triggers (not sure which of
them you use to redirect the data into partitions). But when partitioning
by time (and not by category_id) the number of partitions will be much
lower and you don't have to keep all of the rules active - all you need is
a rule for the current year (and maybe the next one).
I'm not sure what you mean by 'year wrapping issue' but I think it might
work quite well - right not the problem is PostgreSQL decides to scan the
whole partition (all data for a given category_id).
In response to
pgsql-performance by date
|Next:||From: David Jarvis||Date: 2010-05-26 18:26:53|
|Subject: Re: Random Page Cost and Planner|
|Previous:||From: Craig James||Date: 2010-05-26 17:09:41|
|Subject: Re: PostgreSQL Function Language Performance: C vs PL/PGSQL|