Quick Links

Re: Group by more efficient than distinct?

From:	Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	Francisco Reyes <lists(at)stringsutils(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Pgsql performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Group by more efficient than distinct?
Date:	2008-04-21 23:50:22
Message-ID:	480D283E.1060602@mark.mielke.cc
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

PFC wrote:
> Actually, the memory used by the hash depends on the number of
> distinct values, not the number of rows which are processed...
> Consider :
>
> SELECT a GROUP BY a
> SELECT a,count(*) GROUP BY a
>
> In both cases the hash only holds discinct values. So if you have
> 1 million rows to process but only 10 distinct values of "a", the hash
> will only contain those 10 values (and the counts), so it will be very
> small and fast, it will absorb a huge seq scan without problem. If
> however, you have (say) 100 million distinct values for a, using a
> hash would be a bad idea. As usual, divide the size of your RAM by the
> number of concurrent connections or something.
> Note that "a" could be a column, several columns, anything, the
> size of the hash will be proportional to the number of distinct
> values, ie. the number of rows returned by the query, not the number
> of rows processed (read) by the query. Same with hash joins etc,
> that's why when you join a very small table to a large one Postgres
> likes to use seq scan + hash join on the small table.

This surprises me - hash values are lossy, so it must still need to
confirm against the real list of values, which at a minimum should
require references to the rows to check against?

Is PostgreSQL doing something beyond my imagination? :-)

Cheers,
mark

--
Mark Mielke <mark(at)mielke(dot)cc>

In response to

Re: Group by more efficient than distinct? at 2008-04-21 23:34:40 from PFC

Responses

Re: Group by more efficient than distinct? at 2008-04-22 01:39:15 from Mark Mielke
Re: Group by more efficient than distinct? at 2008-04-22 10:34:23 from Matthew Wakeling

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Mark Mielke	2008-04-22 01:39:15	Re: Group by more efficient than distinct?
Previous Message	PFC	2008-04-21 23:34:40	Re: Group by more efficient than distinct?