Re: Multiple Uniques

From: Markus Schaber <schabios(at)logi-track(dot)com>
To: PostgreSQL Performance List <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Multiple Uniques
Date: 2004-09-06 15:56:18
Message-ID: 20040906175618.5497a124@kingfisher.intern.logi-track.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi, Greg,

On 02 Sep 2004 15:33:38 -0400
Greg Stark <gsstark(at)mit(dot)edu> wrote:

> Markus Schaber <schabios(at)logi-track(dot)com> writes:
>
> > logigis=# explain select count(id) from (select ref_in_id as id from streets union select nref_in_id as id from streets) as blubb;
> > QUERY PLAN
> > ---------------------------------------------------------------------------------------------------------
> > Aggregate (cost=16220893.16..16220893.16 rows=1 width=8)
> > -> Subquery Scan blubb (cost=15254815.03..16082881.99 rows=55204464 width=8)
> > -> Unique (cost=15254815.03..15530837.35 rows=55204464 width=8)
> > -> Sort (cost=15254815.03..15392826.19 rows=55204464 width=8)
> > Sort Key: id
> > -> Append (cost=0.00..6810225.28 rows=55204464 width=8)
> > -> Subquery Scan "*SELECT* 1" (cost=0.00..3405112.64 rows=27602232 width=8)
> > -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8)
> > -> Subquery Scan "*SELECT* 2" (cost=0.00..3405112.64 rows=27602232 width=8)
> > -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8)
>
> You can actually go one step further and do:
>
> select count(distinct id) from (select ... union all select ...) as blubb;

Hmm, as query plan, it gives me:

> logigis=# explain select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as id from streets) as blubb;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------
> Aggregate (cost=7500281.08..7500281.08 rows=1 width=8)
> -> Subquery Scan blubb (cost=0.00..7362269.92 rows=55204464 width=8)
> -> Append (cost=0.00..6810225.28 rows=55204464 width=8)
> -> Subquery Scan "*SELECT* 1" (cost=0.00..3405112.64 rows=27602232 width=8)
> -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8)
> -> Subquery Scan "*SELECT* 2" (cost=0.00..3405112.64 rows=27602232 width=8)
> -> Seq Scan on streets (cost=0.00..3129090.32 rows=27602232 width=8)
> (7 rows)

This leaves off the sort and unique completely, and leaves all work to
the count aggrecation. Maybe this is implemented by hashing and thus
somehow more efficient...

> I'm not sure why this is any faster since it still has to do all the same
> work, but it's a different code path and it seems to be about 20% faster for
> me.

When comparing the actual timings, I get:

> logigis=# \timing
> Timing is on.
> logigis=# select count(distinct id) from (select ref_in_id as id from streets union all select nref_in_id as id from streets) as blubb;
> count
> ----------
> 20225596
> (1 row)
>
> Time: 1339098.226 ms
> logigis=# select count(id) from (select ref_in_id as id from streets union select nref_in_id as id from streets) as blubb;
> count
> ----------
> 20225596
> (1 row)
>
> Time: 1558920.784 ms
> logigis=#

So you're right, its faster this way. There seems to be some room to
play with optimizer enhancements.

But simply removing all distincts from subselects would not be the best
way to go, as reducing intermediate datasets can also improve the
performance.

Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios(at)logi-track(dot)com | www.logi-track.com

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2004-09-06 16:40:41 Re: The usual sequential scan, but with LIMIT !
Previous Message Dennis Bjorklund 2004-09-06 12:45:30 Re: The usual sequential scan, but with LIMIT !