Re: Optimize date query for large child tables: GiST or GIN?

From: David Jarvis <thangalin(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Yeb Havinga <yebhavinga(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thombrown(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Optimize date query for large child tables: GiST or GIN?
Date: 2010-05-22 17:54:14
Message-ID: AANLkTil-5tkGrz6cfk-X6ddwaibMMjd5KnVqvxEOuOkV@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

certainly understand that you wouldn't want to partition by year. It
>

Definitely not.

> does strike me that perhaps you could partition by day ranges, but you'd
>

I don't think that will work; users can choose any day range, with the most
common as Jan 1 - Dec 31, followed by seasonal ranges, followed by arbitrary
ranges.

> some of this data.. If users are going to typically use 1900-2009 for
> years, then could the information about all of those years be aggregated
> apriori to make those queries faster?
>

I'm not sure what you mean. I could create a separate table that lumps the
aggregated averages per year per station per category, but that will only
help in the one case. There are five different reporting pages (Basic
through Guru). On three of those pages the user must select arbitrary day
ranges. On one of those pages, the user can select a season, which then maps
to, for all intents and purposes, an arbitrary day range.

Only the most basic page do not offer the user a day range selection.

> Do not get hung up on having to have a separate table for every unique
> value in the column- you don't need that. constraint_exclusion will
>

That's good advice. I have repartitioned the data into seven tables: one per
category.

> I agee with Matthew Wakeling in a different post: its probably wise to
> I would agree with this too- get it working first, then look at
> partitioning. Even more so- work on a smaller data set to begin with
>

The query speed has now much improved thanks to everybody's advice.

From a cost of 10006220141 down to 704924. Here is the query:

SELECT
avg(m.amount),
extract(YEAR FROM m.taken) AS year_taken
FROM
climate.city c,
climate.station s,
climate.measurement m
WHERE
c.id = 5182 AND
6371.009 * SQRT(
POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
(COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
) <= 25 AND
s.elevation BETWEEN 0 AND 3000 AND
m.category_id = 7 AND
m.station_id = s.id AND
extract(YEAR FROM m.taken) BETWEEN 1900 AND 2000
GROUP BY
extract(YEAR FROM m.taken)
ORDER BY
extract(YEAR FROM m.taken)

(Note that *extract(YEAR FROM m.taken)* is much faster than
*date_part('year'::text,
m.taken)*.)

The query plan for the above SQL reveals:

"Sort (cost=704924.25..704924.75 rows=200 width=9) (actual
time=9476.518..9476.521 rows=46 loops=1)"
" Sort Key: (date_part('year'::text, (m.taken)::timestamp without time
zone))"
" Sort Method: quicksort Memory: 28kB"
" -> HashAggregate (cost=704913.10..704916.60 rows=200 width=9) (actual
time=9476.465..9476.489 rows=46 loops=1)"
" -> Hash Join (cost=1043.52..679956.79 rows=4991262 width=9)
(actual time=46.399..9344.537 rows=120678 loops=1)"
" Hash Cond: (m.station_id = s.id)"
" -> Append (cost=0.00..529175.42 rows=14973786 width=13)
(actual time=0.076..7739.647 rows=14874909 loops=1)"
" -> Seq Scan on measurement m (cost=0.00..43.00 rows=1
width=20) (actual time=0.000..0.000 rows=0 loops=1)"
" Filter: ((category_id = 7) AND
(date_part('year'::text, (taken)::timestamp without time zone) >=
1900::double precision) AND (date_part('year'::text, (taken)::timestamp
without time zone) <= 2000::double precision))"
" -> Index Scan using measurement_013_taken_year_idx on
measurement_013 m (cost=0.01..529132.42 rows=14973785 width=13) (actual
time=0.075..6266.385 rows=14874909 loops=1)"
" Index Cond: ((date_part('year'::text,
(taken)::timestamp without time zone) >= 1900::double precision) AND
(date_part('year'::text, (taken)::timestamp without time zone) <=
2000::double precision))"
" Filter: (category_id = 7)"
" -> Hash (cost=992.94..992.94 rows=4046 width=4) (actual
time=43.420..43.420 rows=78 loops=1)"
" -> Nested Loop (cost=0.00..992.94 rows=4046 width=4)
(actual time=0.053..43.390 rows=78 loops=1)"
" Join Filter: ((6371.009::double precision *
sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double
precision), 2::double precision) + (cos((radians(((c.latitude_decimal +
s.latitude_decimal))::double precision) / 2::double precision)) *
pow(radians(((c.longitude_decimal - s.longitude_decimal))::double
precision), 2::double precision))))) <= 25::double precision)"
" -> Index Scan using city_pkey1 on city c
(cost=0.00..4.27 rows=1 width=16) (actual time=0.021..0.022 rows=1 loops=1)"
" Index Cond: (id = 5182)"
" -> Seq Scan on station s (cost=0.00..321.08
rows=12138 width=20) (actual time=0.008..5.457 rows=12139 loops=1)"
" Filter: ((s.elevation >= 0) AND
(s.elevation <= 3000))"
"Total runtime: 9476.626 ms"

That's about 10 seconds using the category with the smallest table. The
largest table takes 17 seconds (fantastic!) after a few runs and 85 seconds
cold. About 1 second is my goal, before the pending hardware upgrades.

When I recreated the tables, I sorted them by date then station id so that
there is now a 1:1 correlation between the sequence number and the
measurement date.

Would clustering on the date and station make a difference?

Is there any other way to index the data that I have missed?

Thank you very much.

Dave

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Peter Koczan 2010-05-22 19:55:22 Re: pg_dump and pg_restore
Previous Message Robert Haas 2010-05-22 11:29:30 Re: pg_dump and pg_restore