Skip site navigation (1) Skip section navigation (2)

Re: Group by more efficient than distinct?

From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: PFC <lists(at)peufeu(dot)com>
Cc: Francisco Reyes <lists(at)stringsutils(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Pgsql performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Group by more efficient than distinct?
Date: 2008-04-22 01:39:15
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
Mark Mielke wrote:
> PFC wrote:
>>     Actually, the memory used by the hash depends on the number of 
>> distinct values, not the number of rows which are processed...
>>     Consider :
>> SELECT a,count(*) GROUP BY a
>>     In both cases the hash only holds discinct values. So if you have 
>> 1 million rows to process but only 10 distinct values of "a", the 
>> hash will only contain those 10 values (and the counts), so it will 
>> be very small and fast, it will absorb a huge seq scan without 
>> problem. If however, you have (say) 100 million distinct values for 
>> a, using a hash would be a bad idea. As usual, divide the size of 
>> your RAM by the number of concurrent connections or something.
>>     Note that "a" could be a column, several columns, anything, the 
>> size of the hash will be proportional to the number of distinct 
>> values, ie. the number of rows returned by the query, not the number 
>> of rows processed (read) by the query. Same with hash joins etc, 
>> that's why when you join a very small table to a large one Postgres 
>> likes to use seq scan + hash join on the small table.
> This surprises me - hash values are lossy, so it must still need to 
> confirm against the real list of values, which at a minimum should 
> require references to the rows to check against?
> Is PostgreSQL doing something beyond my imagination? :-)

Hmmm... You did say distinct values, so I can see how that would work 
for distinct. What about seq scan + hash join, though? To complete the 
join, wouldn't it need to have a reference to each of the rows to join 
against? If there is 20 distinct values and 200 rows in the small table 
- wouldn't it need 200 references to be stored?


Mark Mielke <mark(at)mielke(dot)cc>

In response to

pgsql-performance by date

Next:From: Thomas SprengDate: 2008-04-22 09:48:01
Subject: Re: Oddly slow queries
Previous:From: Mark MielkeDate: 2008-04-21 23:50:22
Subject: Re: Group by more efficient than distinct?

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group