Performance of query

From: Edmund Dengler <edmundd(at)eSentire(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Performance of query
Date: 2003-06-09 04:23:38
Message-ID: Pine.BSO.4.44.0306082339020.8114-100000@cyclops4.esentire.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ok, now I am really confused. I have a query that runs really slow, yet
when I code it up as a function it runs very fast! The optimizer seems to
not evaluate a good method for performing the query.

Background: This is a database based off of Snort. There is a base table
called <event> (with <sid>=sensor ID, <cid>=event ID, <timestamp>=when
event collected, <signature>=which event was recognized), and a number of
depending tables for various bits and pieces of network packet data
(<iphdr>=IP header info, <tcphdr>=TCP header info, <data>=packet payload)
and <signature> (what event).

Notes: Running 7.3.3
Database is around 20GB on disk.

=> select version();
version
---------------------------------------------------------------------
PostgreSQL 7.3.3 on i386-unknown-openbsd3.0, compiled by GCC 2.95.3
(1 row)

=> \d event
Table "public.event"
Column | Type | Modifiers
-----------+--------------------------+-----------
sid | integer | not null
cid | bigint | not null
signature | integer | not null
timestamp | timestamp with time zone | not null
Indexes: event_pkey primary key btree (sid, cid),
event_pull_idx unique btree ("timestamp", sid, cid),
signature_idx btree (signature),
timestamp_idx btree ("timestamp")
Triggers: event_summary

=> \d signature
Table "public.signature"
Column | Type | Modifiers
--------------+---------+----------------------------------------------------------
sig_id | integer | not null default nextval('"signature_sig_id_seq"'::text)
<<...more columns...>>
Indexes: signature_pkey primary key btree (sig_id),
sig_class_id btree (sig_class_id),
sig_name_idx btree (sig_name)

Rest of the tables are primary keyed with (sid, cid). For example,

=> \d iphdr
Table "public.iphdr"
Column | Type | Modifiers
----------+----------+-----------
sid | integer | not null
cid | bigint | not null
<<...more columns...>>
Indexes: iphdr_pkey primary key btree (sid, cid),
ip_dst_idx btree (ip_dst),
ip_src_idx btree (ip_src)
Triggers: iphdr_summary

If I run the query:

explain analyze
select
event.sid, event.cid, event.timestamp, signature.sig_name, iphdr.ip_src, iphdr.ip_dst,
icmphdr.icmp_type, icmphdr.icmp_code, icmphdr.icmp_csum, icmphdr.icmp_id, icmphdr.icmp_seq,
udphdr.udp_sport, udphdr.udp_dport, udphdr.udp_len, udphdr.udp_csum,
tcphdr.tcp_sport, tcphdr.tcp_dport, tcphdr.tcp_seq, tcphdr.tcp_ack, tcphdr.tcp_off,
tcphdr.tcp_res, tcphdr.tcp_flags, tcphdr.tcp_win, tcphdr.tcp_csum, tcphdr.tcp_urp,
sensor.hostname, sensor.interface, data.data_payload
from (
select * from event
where timestamp > (select now() - '2 hours'::interval)
and exists (select 1 from hack_pull_sid where sid = event.sid)
) as event
left join signature on signature.sig_id = event.signature
left join iphdr on iphdr.sid = event.sid and iphdr.cid = event.cid
left join icmphdr on icmphdr.sid = event.sid and icmphdr.cid = event.cid
left join udphdr on udphdr.sid = event.sid and udphdr.cid = event.cid
left join tcphdr on tcphdr.sid = event.sid and tcphdr.cid = event.cid
left join sensor on sensor.sid = event.sid
left join data on data.sid = event.sid and data.cid = event.cid
order by
timestamp desc, event.sid desc, event.cid desc
;

(note: <hack_pull_sid> is a table of SIDs I am interested in so that I
avoid the issues with IN)

I get the following output:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1712181.78..1712776.52 rows=237893 width=853) (actual time=163484.81..163484.97 rows=129 loops=1)
Sort Key: public.event."timestamp", public.event.sid, public.event.cid
InitPlan
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.04..0.04 rows=1 loops=1)
-> Merge Join (cost=1471148.73..1551631.57 rows=237893 width=853) (actual time=152693.86..163484.26 rows=129 loops=1)
Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
-> Merge Join (cost=728456.82..802284.20 rows=237893 width=258) (actual time=33652.33..39050.11 rows=129 loops=1)
Merge Cond: ("outer".sid = "inner".sid)
-> Merge Join (cost=728454.78..798713.57 rows=237893 width=226) (actual time=33635.79..39031.06 rows=129 loops=1)
Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
-> Merge Join (cost=728454.78..736750.59 rows=237893 width=172) (actual time=22767.67..24071.84rows=129 loops=1)
Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
-> Merge Join (cost=728454.78..731219.48 rows=237893 width=144) (actual time=22164.44..22681.23 rows=129 loops=1)
Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
-> Sort (cost=701134.84..701729.57 rows=237893 width=116) (actual time=15343.87..15344.05 rows=129 loops=1)
Sort Key: public.event.sid, public.event.cid
-> Merge Join (cost=590896.58..671684.37 rows=237893 width=116) (actual time=10857.65..15343.57 rows=129 loops=1)
Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
-> Index Scan using iphdr_pkey on iphdr (cost=0.00..73422.89 rows=1432042 width=28) (actual time=23.19..13810.58 rows=687918 loops=1)
-> Sort (cost=590896.58..591491.31 rows=237893 width=88) (actual time=101.79..101.95 rows=129 loops=1)
Sort Key: public.event.sid, public.event.cid
-> Hash Join (cost=185.07..563108.55 rows=237893 width=88) (actual time=97.42..101.39 rows=129 loops=1)
Hash Cond: ("outer".signature = "inner".sig_id)
-> Index Scan using timestamp_idx on event (cost=0.00..558165.62 rows=237893 width=24) (actual time=0.18..3.05 rows=129 loops=1)
Index Cond: ("timestamp" > $0)
Filter: (subplan)
SubPlan
-> Seq Scan on hack_pull_sid (cost=0.00..1.15 rows=1 width=0) (actual time=0.01..0.01 rows=1 loops=171)
Filter: (sid = $1)
-> Hash (cost=167.26..167.26 rows=7126 width=64) (actual time=97.12..97.12 rows=0 loops=1)
-> Seq Scan on signature (cost=0.00..167.26 rows=7126width=64) (actual time=0.01..80.86 rows=7126 loops=1)
-> Sort (cost=27319.94..27897.87 rows=231171 width=28) (actual time=6606.52..7073.38 rows=110717 loops=1)
Sort Key: icmphdr.sid, icmphdr.cid
-> Seq Scan on icmphdr (cost=0.00..3784.71 rows=231171 width=28) (actual time=19.81..2082.01 rows=231303 loops=1)
-> Index Scan using udphdr_pkey on udphdr (cost=0.00..4411.44 rows=115228 width=28) (actual time=60.23..1269.61 rows=54669 loops=1)
-> Index Scan using tcphdr_pkey on tcphdr (cost=0.00..56586.06 rows=1139141 width=54) (actual time=88.72..13878.52 rows=484419 loops=1)
-> Sort (cost=2.04..2.11 rows=30 width=32) (actual time=16.46..16.64 rows=136 loops=1)
Sort Key: sensor.sid
-> Seq Scan on sensor (cost=0.00..1.30 rows=30 width=32) (actual time=16.26..16.34 rows=30 loops=1)
-> Sort (cost=742691.92..745059.80 rows=947154 width=595) (actual time=108643.60..123322.67 rows=417145 loops=1)
Sort Key: data.sid, data.cid
-> Seq Scan on data (cost=0.00..51821.54 rows=947154 width=595) (actual time=9.67..54494.61 rows=947324 loops=1)
Total runtime: 164147.78 msec
(43 rows)

For some runs, I was getting over 800,000 msec!!

Now, this seems very strange for only 129 rows. The inner select returns
exactly the 129 rows I am interested in, and now we only need to do a few
left joins to get the corresponding rows (which may or may not exist,
which is why the LEFT JOIN's). Note that the (<sid>,<cid>) is the primary
key for most of these joins, and so the LEFT JOIN should be able to
recognize that there will only be one or zero rows matching. In addition,
(<sid>,<cid>) is the primary key for the <event> table. Similar
discussion with <signature> as <sig_id> is it's primary key.

Notes:
- The table <event> has 1,427,411 rows
- <event>.<timestamp> ranges from 2002-01-19 12:28:29-05 to
2003-06-08 23:40:01-04
- The last 2 hours is a very small range assuming an even distribution
- All tables are using 100 bins for statistics
- A VACUUM ANALYZE was performed on the entire database before this test

As can be seem, the estimator seems to think that 237,893 rows will be
returned from the INNER SELECT out of 1,427,411 rows. Can anyone explain
how this number could be calculated? Even assuming a straight linear
assumption, this should fall into 2hours/(24hours/day*400days) rough
estimate which is (very roughly) 0.02% of the table.

So, I wrote a function. (Please, no remarks on style, it was a quick hack
to compare performance. I am sure it could be made better and more
optimized). It simply performs the inner SELECT, and then for each row,
does the outer select for the match.

CREATE TYPE hack_pull_type AS (
sid int, cid bigint, timestamp timestamp with time zone,
sig_name text,
ip_src bigint, ip_dst bigint,
icmp_type smallint, icmp_code smallint, icmp_csum int, icmp_id int, icmp_seq int,
udp_sport int, udp_dport int, udp_len int, udp_csum int,
tcp_sport int, tcp_dport int, tcp_seq bigint, tcp_ack bigint, tcp_off smallint,
tcp_res smallint, tcp_flags smallint, tcp_win int, tcp_csum int, tcp_urp int,
hostname text, interface text, data_payload text
);

CREATE OR REPLACE FUNCTION hack_pull_func() RETURNS SETOF hack_pull_type
AS '
DECLARE
event_rec event%ROWTYPE;
-- pull_rec hack_pull_type;
pull_rec RECORD;
BEGIN
FOR event_rec IN
SELECT *
FROM event
WHERE timestamp > (SELECT now() - ''2 hours''::interval)
AND EXISTS (SELECT 1 FROM hack_pull_sid WHERE sid = event.sid)
ORDER BY timestamp desc, sid desc, cid desc
LOOP
SELECT
event.sid, event.cid, event.timestamp,
sig_name, ip_src, ip_dst,
icmp_type, icmp_code, icmp_csum, icmp_id, icmp_seq,
udp_sport, udp_dport, udp_len, udp_csum,
tcp_sport, tcp_dport, tcp_seq, tcp_ack, tcp_off, tcp_res,
tcp_flags, tcp_win, tcp_csum, tcp_urp,
sensor.hostname, sensor.interface, data.data_payload
INTO pull_rec
FROM event
left join signature on signature.sig_id = event.signature
left join iphdr on iphdr.sid = event.sid and iphdr.cid = event.cid
left join icmphdr on icmphdr.sid = event.sid and icmphdr.cid = event.cid
left join udphdr on udphdr.sid = event.sid and udphdr.cid = event.cid
left join tcphdr on tcphdr.sid = event.sid and tcphdr.cid = event.cid
left join sensor on sensor.sid = event.sid
left join data on data.sid = event.sid and data.cid = event.cid
WHERE
event.cid = event_rec.cid and event.sid = event_rec.sid
LIMIT 1;
RETURN NEXT pull_rec;
END LOOP;

RETURN;
END;
' LANGUAGE 'plpgsql';

=> explain analyze select * from hack_pull_func();
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Function Scan on hack_pull_func (cost=0.00..12.50 rows=1000 width=238) (actual time=5259.46..5259.62 rows=23 loops=1)
Total runtime: 5259.70 msec
(2 rows)

=> explain analyze select * from hack_pull_func();
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Function Scan on hack_pull_func (cost=0.00..12.50 rows=1000 width=238) (actual time=38.02..38.18 rows=23 loops=1)
Total runtime: 38.26 msec
(2 rows)

(Sorry, the data has changed as time moves forward and I type this email.
However, the numbers are reflective of the performance difference I
observed during testing).

So, what is happening? How come my function (ugly as it may be), is orders
of magnitude faster than what should be a clean query in SQL? Why is the
estimator so far off the real issue?

Is there a way of writing the query to get the good performance? Note that
there are a number of other queries I run which exhibit similar
performance/estimation issues, and I wonder if I will need to rwrite all
of them into custom functions (thanks for the SETOF piece as this makes it
possible to write this stuff now).

Thanks!
Ed

In response to

Browse pgsql-general by date

  From Date Subject
Next Message The Hermit Hacker 2003-06-09 04:45:13 Archives re-generating ...
Previous Message Murthy Kambhampaty 2003-06-09 03:56:24 Re: Backups and restores.