From: | "Michael Artz" <mlartz(at)gmail(dot)com> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | Subselect query enhancement |
Date: | 2007-02-01 16:42:03 |
Message-ID: | e9c163070702010842ja639378v75e8c0eb757d9b72@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
I'm needing help determining the best all-around query for the
following situation. I have primary table that holds ip information
and two other tables that hold event data for the specific IP in with
a one-to-many mapping between them, ie:
CREATE TABLE ip_info (
ip IP4,
--other data
);
CREATE TABLE network_events (
ip IP4 NOT NULL REFERENCES ip_info(ip),
name VARCHAR,
port INTEGER,
--other data
);
CREATE TABLE host_events (
ip IP4 NOT NULL REFERENCES ip_info(ip),
name VARCHAR
port INTEGER,
--other data
);
There is quite a bit of commonality between the network_events and
host_events schemas, but they do not currently share an ancestor.
ip_info has about 13 million rows, the network_events table has about
30 million rows, and the host_events table has about 7 million rows.
There are indexes on all the rows.
The query that I would like to execute is to select all the rows of
ip_info that have either network or host events that meet some
criteria, i.e. name='blah'. I have 3 different possibilities that I
have thought of to execute this.
First, 2 'ip IN (SELECT ...)' statements joined by an OR:
SELECT * FROM ip_info
WHERE ip IN (SELECT ip FROM network_events WHERE name='blah')
OR ip IN (SELECT ip FROM host_events WHERE name='blah');
Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement:
SELECT * FROM ip_info
WHERE ip IN (SELECT ip FROM network_events WHERE name='blah'
UNION
SELECT ip FROM host_events WHERE name='blah');
Or, finally, the UNION statment with DISTINCTs:
SELECT * FROM ip_info
WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah'
UNION
SELECT DISTINCT ip FROM host_events WHERE name='blah');
From what I have read, the UNION statement does an implicit DISTINCT,
but I thought that doing it on each of the tables would result in
slightly faster execution. Can you think of any other ways to
implement the previous query?
I have explained/analyzed all the queries but, unfortunately, they are
on an isolated computer. The gist is that, for relatively
low-incidence values of name, the UNION performs better, but for
queries on a common name, the dual-subselect query performs better.
The explains look something like:
Dual-subselect:
Seq scan on ip_info
Filter: ... AND ((hashed_subplan) OR (hashed_subplan))
Subplan
-> Result
-> Append
-> various scans on host_events
-> Result
-> Append
-> various scans on network_events
UNION SELECT DISTINCT:
Nested Loop
-> Unique
-> Sort
-> Append
-> Unique
-> Sort
-> Result
-> Append
-> various scans on host_events
-> Unique
-> Sort
-> Result
-> Append
-> various scans on network_events
If it would help to have more information, I could retype some of
numbers in the explain.
Any ideas?
Thanks,
-Mike
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Huxton | 2007-02-01 17:23:47 | Re: Subselect query enhancement |
Previous Message | Mark Stosberg | 2007-02-01 14:10:09 | Using statement_timeout as a performance tool? |