Re: Group by more efficient than distinct?

From: Luke Lonergan <llonergan(at)greenplum(dot)com>
To: Francisco Reyes <lists(at)stringsutils(dot)com>, Pgsql performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Group by more efficient than distinct?
Date: 2008-04-21 05:35:58
Message-ID: C43175CE.5B957%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Francisco,

Generally, PG sorting is much slower than hash aggregation for performing
the distinct operation. There may be small sizes where this isn¹t true, but
for large amounts of data (in-memory or not), hash agg (used most often, but
not always by GROUP BY) is faster.

We¹ve implemented a special optimization to PG sorting that does the
distinct processing within the sort, instead of afterward, but it¹s limited
to some small-ish number (10,000) of distinct values due to it¹s use of a
memory and processing intensive heap.

So, you¹re better off using GROUP BY and making sure that the planner is
using hash agg to do the work.

- Luke

On 4/17/08 8:46 PM, "Francisco Reyes" <lists(at)stringsutils(dot)com> wrote:

> I am trying to get a distinct set of rows from 2 tables.
> After looking at someone else's query I noticed they were doing a group by
> to obtain the unique list.
>
> After comparing on multiple machines with several tables, it seems using
> group by to obtain a distinct list is substantially faster than using
> select distinct.
>
> Is there any dissadvantage of using "group by" to obtain a unique list?
>
> On a small dataset the difference was about 20% percent.
>
> Group by
> HashAggregate (cost=369.61..381.12 rows=1151 width=8) (actual
> time=76.641..85.167 rows=2890 loops=1)
>
> Distinct
> Unique (cost=1088.23..1174.53 rows=1151 width=8) (actual
> time=90.516..140.123 rows=2890 loops=1)
>
> Although I don't have the numbers here with me, a simmilar result was
> obtaining against a query that would return 100,000 rows. 20% and more
> speed differnce between "group by" over "select distinct".
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Adrian Moisey 2008-04-21 09:50:31 connections slowing everything down?
Previous Message Tom Lane 2008-04-21 01:30:50 Re: corrupted shared memory message