Re: Subselect query enhancement

From: Richard Huxton <dev(at)archonet(dot)com>
To: Michael Artz <mlartz(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Subselect query enhancement
Date: 2007-02-02 09:51:42
Message-ID: 45C309AE.1030506@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Michael Artz wrote:
> Here are some numbers for 3 different queries using a very selective
> query (port = 2222). I'm thinking that, since the row estimates are
> different from the actuals (2 vs 2000), that this particular port
> didn't make it into the statistics ... is that true? Does this
> matter? If this isn't enough data, I can type up some more.

SELECT * FROM pg_stats WHERE tablename='foo';
This will show you frequency-stats on each column (as generated by
analyse). You're interested in n_distinct, most_common_vals,
most_common_freqs.

> One thing that I forgot to mention is that the network_events and
> host_events tables are partitioned by first octet of the IP, so when I
> say "various scans of ..." that means that there is a scan of each of
> the partitions, the type determined by the planner and the statistics,
> I assume.

So you've got xxx_events tables partitioned by ip, but ip_info is one
table? Do you do a lot of scans across the bottom 3 bytes of the IP? If
not, I'm not clear what we're gaining from the partitioning.

> ** Dual subselects:
> SELECT * FROM ip_info
> WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222)
> OR ip IN (SELECT ip FROM host_events WHERE port = 2222);
>
> Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual
> time=34..8238 rows=234 loops=1)
> Filter: ... AND ((hashed_subplan) OR (hashed_subplan))

Well, the estimate here is rubbish - 9.3 million rows whereas we
actually get 234. Now we know you're likely to get a lot of overlap, and
the planner might not realise that. Still - that looks very bad. Of
course, because it's expecting so many rows a seq-scan of ip_info looks
like a good choice to it.

> ** Single subselect:
>
> SELECT * FROM ip_info
> WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222
> UNION
> SELECT DISTINCT ip FROM host_events WHERE port = 2222);
>
> Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106
> rows=2349 loops=1)

This is clearly a lot better, Not sure whether the DISTINCT in each
subquery works or not.

> ** The join:
>
> SELECT distinct ip_info.*
> FROM ip_info RIGHT JOIN network_events USING (ip)
> RIGHT JOIN host_events USING (ip)
> WHERE (network_events.port=2222 OR host_events.port=2222)
>
> Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52
> loops=1)
> -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61
> rows=63 loops=1)

OK, so what do the plans look like for port=80 or something larger like
that?

Then try adding an index to the various host/network_events tables
CREATE INDEX ... ON ... (ip) WHERE port=80;

--
Richard Huxton
Archonet Ltd

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Kirk Wythers 2007-02-02 13:52:48 trouble with a join on OS X
Previous Message Bill Howe 2007-02-02 01:44:21 index scan through a subquery