Re: Optimize date query for large child tables: GiST or GIN?

From: Thom Brown <thombrown(at)gmail(dot)com>
To: David Jarvis <thangalin(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Optimize date query for large child tables: GiST or GIN?
Date: 2010-05-20 08:33:15
Message-ID: AANLkTilud7Qc25XZDziBMfzRnn8XCaNCo2jKHtW2-vdF@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 20 May 2010 06:06, David Jarvis <thangalin(at)gmail(dot)com> wrote:
> Hi,
>
> I recently switched to PostgreSQL from MySQL so that I can use PL/R for data
> analysis. The query in MySQL form (against a more complex table structure)
> takes ~5 seconds to run. The query in PostgreSQL I have yet to let finish,
> as it takes over a minute. I think I have the correct table structure in
> place (it is much simpler than the former structure in MySQL), however the
> query executes a full table scan against the parent table's 273 million
> rows.
>
> Questions
>
> What is the proper way to index the dates to avoid full table scans?
>
> Options I have considered:
>
> GIN
> GiST
> Rewrite the WHERE clause
> Separate year_taken, month_taken, and day_taken columns to the tables
>
> Details
>
> The HashAggregate from the plan shows a cost of 10006220141.11, which is, I
> suspect, on the astronomically huge side. There is a full table scan on the
> measurement table (itself having neither data nor indexes) being performed.
> The table aggregates 237 million rows from its child tables. The
> sluggishness comes from this part of the query:
>
>       m.taken BETWEEN
>         /* Start date. */
>       (extract( YEAR FROM m.taken )||'-01-01')::date AND
>         /* End date. Calculated by checking to see if the end date wraps
>           into the next year. If it does, then add 1 to the current year.
>         */
>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>           sign(
>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>         ) AS text)||'-12-31')::date
>
> There are 72 child tables, each having a year index and a station index,
> which are defined as follows:
>
>     CREATE TABLE climate.measurement_12_013 (
>     -- Inherited from table climate.measurement_12_013:  id bigint NOT NULL
> DEFAULT nextval('climate.measurement_id_seq'::regclass),
>     -- Inherited from table climate.measurement_12_013:  station_id integer
> NOT NULL,
>     -- Inherited from table climate.measurement_12_013:  taken date NOT
> NULL,
>     -- Inherited from table climate.measurement_12_013:  amount numeric(8,2)
> NOT NULL,
>     -- Inherited from table climate.measurement_12_013:  category_id
> smallint NOT NULL,
>     -- Inherited from table climate.measurement_12_013:  flag character
> varying(1) NOT NULL DEFAULT ' '::character varying,
>       CONSTRAINT measurement_12_013_category_id_check CHECK (category_id =
> 7),
>       CONSTRAINT measurement_12_013_taken_check CHECK
> (date_part('month'::text, taken)::integer = 12)
>     )
>     INHERITS (climate.measurement)
>
>     CREATE INDEX measurement_12_013_s_idx
>       ON climate.measurement_12_013
>       USING btree
>       (station_id);
>     CREATE INDEX measurement_12_013_y_idx
>       ON climate.measurement_12_013
>       USING btree
>       (date_part('year'::text, taken));
>
> (Foreign key constraints to be added later.)
>
> The following query runs abysmally slow due to a full table scan:
>
>     SELECT
>       count(1) AS measurements,
>       avg(m.amount) AS amount
>     FROM
>       climate.measurement m
>     WHERE
>       m.station_id IN (
>         SELECT
>           s.id
>         FROM
>           climate.station s,
>           climate.city c
>         WHERE
>             /* For one city... */
>             c.id = 5182 AND
>
>             /* Where stations are within an elevation range... */
>             s.elevation BETWEEN 0 AND 3000 AND
>
>             /* and within a specific radius... */
>             6371.009 * SQRT(
>               POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
>                 (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
>                   POW(RADIANS(c.longitude_decimal - s.longitude_decimal),
> 2))
>             ) <= 50
>         ) AND
>
>       /* Data before 1900 is shaky; insufficient after 2009. */
>       extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND
>
>       /* Whittled down by category... */
>       m.category_id = 1 AND
>
>       /* Between the selected days and years... */
>       m.taken BETWEEN
>        /* Start date. */
>        (extract( YEAR FROM m.taken )||'-01-01')::date AND
>         /* End date. Calculated by checking to see if the end date wraps
>            into the next year. If it does, then add 1 to the current year.
>         */
>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>           sign(
>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>         ) AS text)||'-12-31')::date
>     GROUP BY
>       extract( YEAR FROM m.taken )
>
> What are your thoughts?
>
> Thank you!
>
>

Could you provide the EXPLAIN output for that slow query?

Thom

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew Wakeling 2010-05-20 13:03:11 Re: Optimize date query for large child tables: GiST or GIN?
Previous Message Yeb Havinga 2010-05-20 08:20:42 Re: Optimize date query for large child tables: GiST or GIN?