Re: PG7.4.5: query not using index on date column

From: Dave Steinberg <dave-dated-1102349979(dot)6e4a3f(at)redterror(dot)net>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: PG7.4.5: query not using index on date column
Date: 2004-11-29 16:19:30
Message-ID: 726FC50F-4222-11D9-A28C-0030656E7E7A@redterror.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Sorry for the delay in getting back on this thread, painting my
apartment meant no internet over this thanksgiving break... :)

> How many rows in the table altogether? A rough guess is a few million
> based on the estimated seqscan cost. That would mean that this query
> is retrieving about 10% of the table, which is a large enough fraction
> that the planner will probably think a seqscan is best. It may be
> right.
> If you do "set enable_seqscan = off", how does the EXPLAIN ANALYZE
> output change?

Right now it's a hair under 800k rows. With enable_seqscan = off, I
get this as my analyze results:


QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-------------
Sort (cost=361676.23..361676.33 rows=38 width=226) (actual
time=46076.756..46076.924 rows=69 loops=1)
Sort Key: count
-> Subquery Scan aggs (cost=361674.10..361675.24 rows=38
width=226) (actual time=46068.621..46076.159 rows=69 loops=1)
-> HashAggregate (cost=361674.10..361674.86 rows=38
width=54) (actual time=46068.596..46075.170 rows=69 loops=1)
-> Index Scan using received_date_idx on messages
(cost=0.00..349968.44 rows=585283 width=54) (actual
time=20.988..15020.821 rows=589543 loops=1)
Index Cond: ((received_date >= '2004-11-01'::date)
AND (received_date <= '2004-11-30'::date))
Total runtime: 46091.315 ms

Quite a bit higher with these estimates. Using your 10% of the table
guess above, I did a breakdown by date and got these results:

geekadmin(at)db:geekisp=>select received_date, count(received_date) from
spamreport.messages group by received_date order by received_date asc;
received_date | count
---------------+-------
2004-10-20 | 7592
2004-10-21 | 19699
2004-10-22 | 17311
2004-10-23 | 16730
2004-10-24 | 18249
2004-10-25 | 16718
2004-10-26 | 16951
2004-10-27 | 19818
2004-10-28 | 19580
2004-10-29 | 17610
2004-10-30 | 16210
2004-10-31 | 20468
2004-11-01 | 12337
2004-11-02 | 9012
2004-11-03 | 20871
2004-11-04 | 20103
2004-11-05 | 18807
2004-11-06 | 20131
2004-11-07 | 22291
2004-11-08 | 23041
2004-11-09 | 20254
2004-11-10 | 17810
2004-11-11 | 21091
2004-11-12 | 21976
2004-11-13 | 18824
2004-11-14 | 20543
2004-11-15 | 18829
2004-11-16 | 24248
2004-11-17 | 18093
2004-11-18 | 25675
2004-11-19 | 27084
2004-11-20 | 22362
2004-11-21 | 25187
2004-11-22 | 26451
2004-11-23 | 26016
2004-11-24 | 23147
2004-11-25 | 25785
2004-11-26 | 20584
2004-11-27 | 25615
2004-11-28 | 6931
2004-11-29 | 6549
(41 rows)

So it looks like an aggregation of 2 weeks worth of data is more than
10%, so its aligned with what you were saying. That also jives, since
when I re-enable seq_scan and do the report for one or two day's of
data - it uses the index (previously it did not - perhaps there was not
enough data).

> If it's not right, you may want to try to adjust random_page_cost
> and/or
> effective_cache_size so that the planner's estimated costs are more in
> line with reality. Beware of making such adjustments on the basis of
> only one test case, though.

I'll look in the manual and try playing with these options, thanks for
your suggestions! In the end, I'm guessing that if I need to retrieve
a few hundred thousand rows from disk, faster disks might make the
biggest impact on performance.

Thanks for your help!
--
Dave Steinberg
http://www.geekisp.com/

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message sreejith s 2004-11-29 17:01:33 Crystal Report + Bytea field
Previous Message Olivier Hubaut 2004-11-29 13:45:42 Re: INDEX and LIKE