Quick Links

Re: Subselect query enhancement

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Michael Artz <mlartz(at)gmail(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Subselect query enhancement
Date:	2007-02-02 09:51:42
Message-ID:	45C309AE.1030506@archonet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Michael Artz wrote:
> Here are some numbers for 3 different queries using a very selective
> query (port = 2222). I'm thinking that, since the row estimates are
> different from the actuals (2 vs 2000), that this particular port
> didn't make it into the statistics ... is that true? Does this
> matter? If this isn't enough data, I can type up some more.

SELECT * FROM pg_stats WHERE tablename='foo';
This will show you frequency-stats on each column (as generated by
analyse). You're interested in n_distinct, most_common_vals,
most_common_freqs.

> One thing that I forgot to mention is that the network_events and
> host_events tables are partitioned by first octet of the IP, so when I
> say "various scans of ..." that means that there is a scan of each of
> the partitions, the type determined by the planner and the statistics,
> I assume.

So you've got xxx_events tables partitioned by ip, but ip_info is one
table? Do you do a lot of scans across the bottom 3 bytes of the IP? If
not, I'm not clear what we're gaining from the partitioning.

> ** Dual subselects:
> SELECT * FROM ip_info
> WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222)
> OR ip IN (SELECT ip FROM host_events WHERE port = 2222);
>
> Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual
> time=34..8238 rows=234 loops=1)
> Filter: ... AND ((hashed_subplan) OR (hashed_subplan))

Well, the estimate here is rubbish - 9.3 million rows whereas we
actually get 234. Now we know you're likely to get a lot of overlap, and
the planner might not realise that. Still - that looks very bad. Of
course, because it's expecting so many rows a seq-scan of ip_info looks
like a good choice to it.

> ** Single subselect:
>
> SELECT * FROM ip_info
> WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222
> UNION
> SELECT DISTINCT ip FROM host_events WHERE port = 2222);
>
> Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106
> rows=2349 loops=1)

This is clearly a lot better, Not sure whether the DISTINCT in each
subquery works or not.

> ** The join:
>
> SELECT distinct ip_info.*
> FROM ip_info RIGHT JOIN network_events USING (ip)
> RIGHT JOIN host_events USING (ip)
> WHERE (network_events.port=2222 OR host_events.port=2222)
>
> Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52
> loops=1)
> -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61
> rows=63 loops=1)

OK, so what do the plans look like for port=80 or something larger like
that?

Then try adding an index to the various host/network_events tables
CREATE INDEX ... ON ... (ip) WHERE port=80;

--
Richard Huxton
Archonet Ltd

In response to

Re: Subselect query enhancement at 2007-02-01 19:06:23 from Michael Artz

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Kirk Wythers	2007-02-02 13:52:48	trouble with a join on OS X
Previous Message	Bill Howe	2007-02-02 01:44:21	index scan through a subquery