Re: [8.1.4] Help optimizing query

From: "George Pavlov" <gpavlov(at)mynewplace(dot)com>
To: "Hayes" <aperez(at)hayesinc(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [8.1.4] Help optimizing query
Date: 2006-08-27 16:40:49
Message-ID: 8C5B026B51B6854CBE88121DBF097A863CD6DE@ehost010-33.exch010.intermedia.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Without having looked at this in detail my first suggestion would be to
do away with those date_part indices. I have found that indexes with few
distinct values usually hurt more then help and the PG optimizer is not
always smart enough to ignore them and the BitmapAnd and scan for dates
seem like a waste since you can consolidate that information from the
get-go, e.g. could you rewrite your WHERE clause to be something like:

WHERE date_trunc('quarter', entry_date) = '2006-04-01' -- for 2nd
quarter of '06

or

WHERE entry_date >= '2006-04-01'
AND entry_date < '2006-07-01'

You could try an index on either the date_trunc or the entry_date
itself, as appropriate.

I assume your user_tracking table is being inserted onto on each visit,
so you may want to be very cautious with what indexes you have on it
anyway and pare those down.

BTW, you have a redundant "WHERE subscription_id > 0" in the outer
query, not that that affects much.

From then on you may want to materialize the subselect so that all of
your five queries use it or, if possible, consolidate those five queries
into one or at least less than five. You can go even further -- seems
like this would be a good candidate for an aggregated reporting table
(essentially your subselect as a separate table updated by triggers, or
every night, or whatever). Especially since the situation will only get
worse as you have more data in your system.

George

> -----Original Message-----
> From: pgsql-performance-owner(at)postgresql(dot)org
> [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Hayes
> Sent: Saturday, August 26, 2006 1:12 PM
> To: pgsql-performance(at)postgresql(dot)org
> Subject: [PERFORM] [8.1.4] Help optimizing query
>
> Hi all,
>
> I need help making the below query go faster. There are about 5 of
> these being done
> in one report. Some use joins instead of subselects but they all go
> about the same speed.
> I'm not much of a SQL person so I just threw indices on everything
> involved but it still
> takes about at least 19sec and sometimes as much as 60s. I'd
> be happy
> to get it to about
> 5s.
>
> Other info: random_page_cost is 4 as I/O on this box as kinda slow.
> It's a converted desktop with a single IDE drive. shared_buffers is
> 12500 and effective_page_cache is 100M. I upped the statistics on
> entry_date and session_id to 1000. I analyzed the tables after
> modifying the statistics. The actual deployment platform is a lot
> beefier but I'd like these queries to at least be tolerable on this
> machine.
>
> I can see that the estimate for the GroupAggregate is off, if I'm
> interpreting things correctly, but I don't know what to do about it.
>
> tia,
> arturo
>
> Query:
>
> SELECT subscription_id,
> to_char(sum(session_length), 'HH24:MI:SS') as session_length,
> sum(hits) as hits,
> 2006 as theYear,
> 2 as theQuarter,
> sum(count) as count
> FROM (
> SELECT subscription_id,
> count(distinct session_id) as count,
> age(MAX(entry_date),MIN(entry_date)) as session_length,
> COUNT(action) as hits
> FROM
> extended_user JOIN user_tracking USING (user_id)
> WHERE subscription_id > 0 AND
> EXTRACT(year from entry_date) = 2006 AND
> EXTRACT(quarter from entry_date) = 2
> GROUP BY session_id,
> subscription_id
> ) as session_stuff
> WHERE subscription_id > 0
> GROUP BY subscription_id
> ORDER BY subscription_id;
>
>
> Sort (cost=123305.88..123306.38 rows=200 width=36) (actual
> time=75039.706..75040.500 rows=258 loops=1)
> Sort Key: session_stuff.subscription_id
> -> HashAggregate (cost=123294.24..123298.24 rows=200 width=36)
> (actual time=75036.487..75038.360 rows=258 loops=1)
> -> GroupAggregate (cost=108839.34..118475.94 rows=240915
> width=72) (actual time=68016.583..74702.710 rows=38369 loops=1)
> -> Sort (cost=108839.34..109441.63 rows=240915
> width=72) (actual time=67978.193..68982.962 rows=245727 loops=1)
> Sort Key: user_tracking.session_id,
> extended_user.subscription_id
> -> Hash Join (cost=7746.59..75492.37
> rows=240915
> width=72) (actual time=16944.487..50737.230 rows=245727 loops=1)
> Hash Cond: ("outer".user_id =
> "inner".user_id)
> -> Bitmap Heap Scan on user_tracking
> (cost=7524.10..68644.10 rows=240950 width=72) (actual
> time=16843.695..48306.383 rows=258923 loops=1)
> Recheck Cond:
> ((date_part('quarter'::text, entry_date) = 2::double precision) AND
> (date_part('year'::text, entry_date) = 2006::double precision))
> -> BitmapAnd
> (cost=7524.10..7524.10
> rows=240950 width=0) (actual time=16779.178..16779.178 rows=0 loops=1)
> -> Bitmap Index Scan on
> user_tracking_quarter_idx (cost=0.00..3331.51 rows=533288 width=0)
> (actual time=9079.545..9079.545 rows=533492 loops=1)
> Index Cond:
> (date_part('quarter'::text, entry_date) = 2::double precision)
> -> Bitmap Index Scan on
> user_tracking_year_idx (cost=0.00..4192.34 rows=671239
> width=0) (actual
> time=7685.906..7685.906 rows=671787 loops=1)
> Index Cond:
> (date_part('year'::text, entry_date) = 2006::double precision)
> -> Hash (cost=206.42..206.42 rows=6428
> width=8) (actual time=100.754..100.754 rows=6411 loops=1)
> -> Seq Scan on extended_user
> (cost=0.00..206.42 rows=6428 width=8) (actual time=0.020..28.873
> rows=6411 loops=1)
> Filter: ((subscription_id > 0)
> AND (subscription_id > 0))
> Total runtime: 75069.453 ms
>
> Tables:
>
> This one has about 6-7k rows.
>
> Table "public.extended_user"
> Column | Type | Modifiers
> -------------------+--------------------------+-----------
> create_date | timestamp with time zone | not null
> email | character varying(99) |
> first_name | character varying(99) | not null
> last_name | character varying(99) | not null
> license_agreement | boolean | not null
> license_date | timestamp with time zone |
> password | character varying(32) | not null
> subscription_id | integer | not null
> user_id | integer | not null
> user_name | character varying(99) | not null
> Indexes:
> "extended_user_pkey" PRIMARY KEY, btree (user_id)
> "extended_user_subscription_id_idx" btree (subscription_id)
> "extended_user_subscription_idx" btree (subscription_id)
> Foreign-key constraints:
> "extended_user_subscription_id_fkey" FOREIGN KEY
> (subscription_id)
> REFERENCES subscription(subscription_id) DEFERRABLE INITIALLY DEFERRED
>
> This one has about 2k rows.
>
> Table "public.subscription"
> Column | Type | Modifiers
> ------------------+--------------------------+-----------
> allow_printing | boolean | not null
> company_id | character varying(50) | not null
> company_name | character varying(100) | not null
> end_date | timestamp with time zone |
> licenses | integer | not null
> pass_through_key | character varying(50) |
> start_date | timestamp with time zone | not null
> subscription_id | integer | not null
> Indexes:
> "subscription_pkey" PRIMARY KEY, btree (subscription_id)
>
>
> This one has about 1.4M rows. It's kind of a log of pages visited.
>
> Table
> "public.user_tracking"
> Column | Type |
>
> Modifiers
> ------------------+-----------------------------+-------------
> -----------
> --------------------------------------------------
> action | character varying(255) | not null
> entry_date | timestamp without time zone | not null
> note | text |
> report_id | integer |
> session_id | character varying(255) | not null
> user_id | integer |
> user_tracking_id | integer | not null default
> nextval('user_tracking_user_tracking_id_seq'::regclass)
> Indexes:
> "user_tracking_pkey" PRIMARY KEY, btree (user_tracking_id)
> "user_tracking_entry_date_idx" btree (entry_date)
> "user_tracking_month_idx" btree (date_part('month'::text,
> entry_date))
> "user_tracking_quarter_idx" btree (date_part('quarter'::text,
> entry_date))
> "user_tracking_report_id_idx" btree (report_id)
> "user_tracking_session_idx" btree (session_id)
> "user_tracking_user_id_idx" btree (user_id)
> "user_tracking_year_idx" btree (date_part('year'::text,
> entry_date))
> Foreign-key constraints:
> "user_tracking_report_id_fkey" FOREIGN KEY (report_id) REFERENCES
> article(article_id) DEFERRABLE INITIALLY DEFERRED
> "user_tracking_user_id_fkey" FOREIGN KEY (user_id) REFERENCES
> extended_user(user_id) DEFERRABLE INITIALLY DEFERRED
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2006-08-28 02:26:12 Re: Benchmarks
Previous Message Alvaro Herrera 2006-08-27 14:37:54 Re: stats reset during pg_restore?