Skip site navigation (1) Skip section navigation (2)

Re: optimizing selects on time-series data in Pg

From: Aditya <aditya(at)grot(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: sfpug(at)postgresql(dot)org
Subject: Re: optimizing selects on time-series data in Pg
Date: 2003-08-01 17:13:39
Message-ID: 20030801171339.GA28448@mighty.grot.org (view raw or flat)
Thread:
Lists: sfpug
On Fri, Aug 01, 2003 at 09:58:24AM -0700, Josh Berkus wrote:
> Can you post your "slow queries"?

Here's what I did:

- I did a vacuum

- did a select count(*) for a few different virtualhosts to see how many rows
were being dealt with

- run explain analyze on each of those select count(*) and noted the
difference, ie. the planner decided that a select on a virtual host with many
more rows was better served by sequential scan than index scan

The typical query is something like show me all the rows for a virtualhost in
the last two days with the latest first, ie.

select 
 * 
from zp_log 
where
 timestamp > (now() - interval '2 days')
 and virtualhost = 'www.bloki.com';

- dropped the virtualhost index and recreated it, verified the behaviour was
the same as above

- forced the planner not to use sequential scan and verified that the index
scan was in fact more costly then, here's a somewhat edited transcript (only
for brevity):

zp1139=> vacuum full verbose analyze zp_log;
NOTICE:  --Relation zp_log--
NOTICE:  Pages 67455: Changed 2444, reaped 26393, Empty 0, New 0; Tup 1043375:
Vac 11184, Keep/VTL 0/0, UnUsed 660823, MinLen 141, MaxLen 690; Re-using:
Free/Avail. Space 213885416/212438124; EndEmpty/Avail. Pages 0/46959.
        CPU 6.50s/1.43u sec elapsed 80.16 sec.
NOTICE:  Index zp_log_vhost: Pages 10620; Tuples 1043375: Deleted 11184.
        CPU 2.26s/0.64u sec elapsed 95.34 sec.
NOTICE:  Index zp_log_timestamp: Pages 7199; Tuples 1043375: Deleted 11184.
        CPU 0.61s/0.57u sec elapsed 10.01 sec.
NOTICE:  Rel zp_log: Pages: 67455 --> 41499; Tuple(s) moved: 646374.
        CPU 47.37s/187.31u sec elapsed 1826.09 sec.
NOTICE:  Index zp_log_vhost: Pages 14225; Tuples 1043375: Deleted 646374.
        CPU 4.47s/7.12u sec elapsed 161.17 sec.
NOTICE:  Index zp_log_timestamp: Pages 9667; Tuples 1043375: Deleted 646374.
        CPU 2.64s/9.53u sec elapsed 79.01 sec.
NOTICE:  --Relation pg_toast_492740--
NOTICE:  Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL
0/0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space 0/0;
EndEmpty/Avail. Pages 0/0.
        CPU 0.00s/0.00u sec elapsed 0.00 sec.
NOTICE:  Index pg_toast_492740_idx: Pages 1; Tuples 0.
        CPU 0.00s/0.00u sec elapsed 0.03 sec.
NOTICE:  Analyzing zp_log
VACUUM
zp1139=> select count(*) from zp_log ;
  count
---------
 1043375
(1 row)

zp1139=> select count(*) from zp_log where virtualhost = 'aditya.bloki.org';
 count
-------
  5012
(1 row)

zp1139=> select count(*) from zp_log where virtualhost = 'www.bloki.com';
 count
--------
 303007
(1 row)

zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'aditya.bloki.org';
NOTICE:  QUERY PLAN:

Aggregate (cost=5017.50..5017.50 rows=1 width=0) (actual time=3017.90..3017.90
rows=1 loops=1)
  -> Index Scan using zp_log_vhost on zp_log (cost=0.00..5014.24 rows=1302
width=0) (actual time=67.94..2990.22 rows=5012 loops=1)
Total runtime: 3018.50 msec

EXPLAIN
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'www.bloki.com';
NOTICE:  QUERY PLAN:

Aggregate (cost=55289.81..55289.81 rows=1 width=0) (actual
time=36120.98..36120.98 rows=1 loops=1)
  -> Seq Scan on zp_log (cost=0.00..54541.19 rows=299449 width=0) (actual
time=21.60..34645.53 rows=303007 loops=1)
Total runtime: 36121.51 msec

EXPLAIN
zp1139=> \d zp_log
                   Table "zp_log"
   Column    |           Type           | Modifiers
-------------+--------------------------+-----------
 priority    | character varying(15)    |
 category    | character varying(30)    |
 thread      | character varying(30)    |
 message     | character varying(255)   |
 timestamp   | timestamp with time zone |
 remoteip    | text                     | not null
 remotehost  | text                     |
 referer     | text                     |
 status      | integer                  | not null
 port        | integer                  | not null
 virtualhost | text                     | not null
 query       | text                     |
 bytes       | integer                  | not null
 useragent   | text                     |
Indexes: zp_log_timestamp,
         zp_log_vhost

zp1139=> \d zp_log_vhost
Index "zp_log_vhost"
   Column    | Type
-------------+------
 virtualhost | text
btree

zp1139=> \d zp_log_timestamp
       Index "zp_log_timestamp"
  Column   |           Type
-----------+--------------------------
 timestamp | timestamp with time zone
btree

zp1139=> drop index zp_log_vhost;
DROP
zp1139=> create index zp_log_vhost on zp_log (virtualhost);
CREATE
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'bloki.com';
NOTICE:  QUERY PLAN:

Aggregate (cost=5006.50..5006.50 rows=1 width=0) (actual time=1791.87..1791.87
rows=1 loops=1)
  -> Index Scan using zp_log_vhost on zp_log (cost=0.00..5003.24 rows=1302
width=0) (actual time=97.89..1786.38 rows=592 loops=1)
Total runtime: 1792.42 msec

EXPLAIN
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'www.bloki.com';
NOTICE:  QUERY PLAN:

Aggregate (cost=55289.81..55289.81 rows=1 width=0) (actual
time=35164.32..35164.33 rows=1 loops=1)
  -> Seq Scan on zp_log (cost=0.00..54541.19 rows=299449 width=0) (actual
time=23.90..33726.02 rows=303007 loops=1)
Total runtime: 35164.48 msec

EXPLAIN
zp1139=> set ENABLE_SEQSCAN to off;
SET VARIABLE
zp1139=> explain analyze select count(*) from zp_log where virtualhost =
'www.bloki.com';
NOTICE:  QUERY PLAN:

Aggregate (cost=1140051.72..1140051.72 rows=1 width=0) (actual
time=97158.42..97158.42 rows=1 loops=1)
  -> Index Scan using zp_log_vhost on zp_log (cost=0.00..1139303.10
rows=299449 width=0) (actual time=83.77..96223.63 rows=303007 loops=1)
Total runtime: 97158.93 msec

EXPLAIN
zp1139=> set ENABLE_SEQSCAN to on;
SET VARIABLE

-- here we try using both indices and see if that makes a difference

zp1139=> explain analyze select count(*) from zp_log where timestamp > (now()
- interval '2 days') and virtualhost ='aditya.bloki.org';
NOTICE:  QUERY PLAN:

Aggregate (cost=3795.19..3795.19 rows=1 width=0) (actual time=4185.29..4185.29
rows=1 loops=1)
  -> Index Scan using zp_log_vhost on zp_log (cost=0.00..3794.37 rows=327
width=0) (actual time=79.67..4180.29 rows=1564 loops=1)
Total runtime: 4185.87 msec

EXPLAIN
zp1139=> explain analyze select count(*) from zp_log where timestamp > (now()
- interval '2 days') and virtualhost = 'www.bloki.com';
NOTICE:  QUERY PLAN:

Aggregate (cost=62616.04..62616.04 rows=1 width=0) (actual
time=28471.34..28471.34 rows=1 loops=1)
  -> Seq Scan on zp_log (cost=0.00..62366.50 rows=99816 width=0) (actual
time=0.18..28439.29 rows=11417 loops=1)
Total runtime: 28471.81 msec

EXPLAIN

In response to

Responses

sfpug by date

Next:From: Josh BerkusDate: 2003-08-01 17:33:42
Subject: Re: optimizing selects on time-series data in Pg
Previous:From: Josh BerkusDate: 2003-08-01 16:58:24
Subject: Re: optimizing selects on times-eries data in Pg

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group