Re: optimizing selects on time-series data in Pg

From: Aditya <aditya(at)grot(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: sfpug(at)postgresql(dot)org
Subject: Re: optimizing selects on time-series data in Pg
Date: 2003-08-01 17:13:39
Message-ID: 20030801171339.GA28448@mighty.grot.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: sfpug

On Fri, Aug 01, 2003 at 09:58:24AM -0700, Josh Berkus wrote:
> Can you post your "slow queries"?

Here's what I did:

- I did a vacuum

- did a select count(*) for a few different virtualhosts to see how many rows
were being dealt with

- run explain analyze on each of those select count(*) and noted the
difference, ie. the planner decided that a select on a virtual host with many
more rows was better served by sequential scan than index scan

The typical query is something like show me all the rows for a virtualhost in
the last two days with the latest first, ie.

select
*
from zp_log
where
timestamp > (now() - interval '2 days')
and virtualhost = 'www.bloki.com';

- dropped the virtualhost index and recreated it, verified the behaviour was
the same as above

- forced the planner not to use sequential scan and verified that the index
scan was in fact more costly then, here's a somewhat edited transcript (only
for brevity):

zp1139=> vacuum full verbose analyze zp_log;
NOTICE: --Relation zp_log--
NOTICE: Pages 67455: Changed 2444, reaped 26393, Empty 0, New 0; Tup 1043375:
Vac 11184, Keep/VTL 0/0, UnUsed 660823, MinLen 141, MaxLen 690; Re-using:
Free/Avail. Space 213885416/212438124; EndEmpty/Avail. Pages 0/46959.
CPU 6.50s/1.43u sec elapsed 80.16 sec.
NOTICE: Index zp_log_vhost: Pages 10620; Tuples 1043375: Deleted 11184.
CPU 2.26s/0.64u sec elapsed 95.34 sec.
NOTICE: Index zp_log_timestamp: Pages 7199; Tuples 1043375: Deleted 11184.
CPU 0.61s/0.57u sec elapsed 10.01 sec.
NOTICE: Rel zp_log: Pages: 67455 --> 41499; Tuple(s) moved: 646374.
CPU 47.37s/187.31u sec elapsed 1826.09 sec.
NOTICE: Index zp_log_vhost: Pages 14225; Tuples 1043375: Deleted 646374.
CPU 4.47s/7.12u sec elapsed 161.17 sec.
NOTICE: Index zp_log_timestamp: Pages 9667; Tuples 1043375: Deleted 646374.
CPU 2.64s/9.53u sec elapsed 79.01 sec.
NOTICE: --Relation pg_toast_492740--
NOTICE: Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL
0/0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space 0/0;
EndEmpty/Avail. Pages 0/0.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
NOTICE: Index pg_toast_492740_idx: Pages 1; Tuples 0.
CPU 0.00s/0.00u sec elapsed 0.03 sec.
NOTICE: Analyzing zp_log
VACUUM
zp1139=> select count(*) from zp_log ;
count
---------
1043375
(1 row)

zp1139=> select count(*) from zp_log where virtualhost = 'aditya.bloki.org';
count
-------
5012
(1 row)

zp1139=> select count(*) from zp_log where virtualhost = 'www.bloki.com';
count
--------
303007
(1 row)

zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'aditya.bloki.org';
NOTICE: QUERY PLAN:

Aggregate (cost=5017.50..5017.50 rows=1 width=0) (actual time=3017.90..3017.90
rows=1 loops=1)
-> Index Scan using zp_log_vhost on zp_log (cost=0.00..5014.24 rows=1302
width=0) (actual time=67.94..2990.22 rows=5012 loops=1)
Total runtime: 3018.50 msec

EXPLAIN
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'www.bloki.com';
NOTICE: QUERY PLAN:

Aggregate (cost=55289.81..55289.81 rows=1 width=0) (actual
time=36120.98..36120.98 rows=1 loops=1)
-> Seq Scan on zp_log (cost=0.00..54541.19 rows=299449 width=0) (actual
time=21.60..34645.53 rows=303007 loops=1)
Total runtime: 36121.51 msec

EXPLAIN
zp1139=> \d zp_log
Table "zp_log"
Column | Type | Modifiers
-------------+--------------------------+-----------
priority | character varying(15) |
category | character varying(30) |
thread | character varying(30) |
message | character varying(255) |
timestamp | timestamp with time zone |
remoteip | text | not null
remotehost | text |
referer | text |
status | integer | not null
port | integer | not null
virtualhost | text | not null
query | text |
bytes | integer | not null
useragent | text |
Indexes: zp_log_timestamp,
zp_log_vhost

zp1139=> \d zp_log_vhost
Index "zp_log_vhost"
Column | Type
-------------+------
virtualhost | text
btree

zp1139=> \d zp_log_timestamp
Index "zp_log_timestamp"
Column | Type
-----------+--------------------------
timestamp | timestamp with time zone
btree

zp1139=> drop index zp_log_vhost;
DROP
zp1139=> create index zp_log_vhost on zp_log (virtualhost);
CREATE
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'bloki.com';
NOTICE: QUERY PLAN:

Aggregate (cost=5006.50..5006.50 rows=1 width=0) (actual time=1791.87..1791.87
rows=1 loops=1)
-> Index Scan using zp_log_vhost on zp_log (cost=0.00..5003.24 rows=1302
width=0) (actual time=97.89..1786.38 rows=592 loops=1)
Total runtime: 1792.42 msec

EXPLAIN
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'www.bloki.com';
NOTICE: QUERY PLAN:

Aggregate (cost=55289.81..55289.81 rows=1 width=0) (actual
time=35164.32..35164.33 rows=1 loops=1)
-> Seq Scan on zp_log (cost=0.00..54541.19 rows=299449 width=0) (actual
time=23.90..33726.02 rows=303007 loops=1)
Total runtime: 35164.48 msec

EXPLAIN
zp1139=> set ENABLE_SEQSCAN to off;
SET VARIABLE
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'www.bloki.com';
NOTICE: QUERY PLAN:

Aggregate (cost=1140051.72..1140051.72 rows=1 width=0) (actual
time=97158.42..97158.42 rows=1 loops=1)
-> Index Scan using zp_log_vhost on zp_log (cost=0.00..1139303.10
rows=299449 width=0) (actual time=83.77..96223.63 rows=303007 loops=1)
Total runtime: 97158.93 msec

EXPLAIN
zp1139=> set ENABLE_SEQSCAN to on;
SET VARIABLE

-- here we try using both indices and see if that makes a difference

zp1139=> explain analyze select count(*) from zp_log where timestamp > (now()
- interval '2 days') and virtualhost ='aditya.bloki.org';
NOTICE: QUERY PLAN:

Aggregate (cost=3795.19..3795.19 rows=1 width=0) (actual time=4185.29..4185.29
rows=1 loops=1)
-> Index Scan using zp_log_vhost on zp_log (cost=0.00..3794.37 rows=327
width=0) (actual time=79.67..4180.29 rows=1564 loops=1)
Total runtime: 4185.87 msec

EXPLAIN
zp1139=> explain analyze select count(*) from zp_log where timestamp > (now()
- interval '2 days') and virtualhost = 'www.bloki.com';
NOTICE: QUERY PLAN:

Aggregate (cost=62616.04..62616.04 rows=1 width=0) (actual
time=28471.34..28471.34 rows=1 loops=1)
-> Seq Scan on zp_log (cost=0.00..62366.50 rows=99816 width=0) (actual
time=0.18..28439.29 rows=11417 loops=1)
Total runtime: 28471.81 msec

EXPLAIN

In response to

Responses

Browse sfpug by date

  From Date Subject
Next Message Josh Berkus 2003-08-01 17:33:42 Re: optimizing selects on time-series data in Pg
Previous Message Josh Berkus 2003-08-01 16:58:24 Re: optimizing selects on times-eries data in Pg