Re: Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: custom hash-based COUNT(DISTINCT) aggregate - unexpectedly high memory consumption
Date: 2013-10-07 19:56:33
Message-ID: 525311F1.5020408@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7.10.2013 14:50, ktm(at)rice(dot)edu wrote:
> On Mon, Oct 07, 2013 at 12:41:58AM +0200, Tomas Vondra wrote:
>>> 2. Consider using a simpler/faster hash function, like FNV[1] or Jenkins[2].
>>> For fun, try not hashing those ints at all and see how that performs (that,
>>> I think, is what you get from HashSet<int> in Java/C#).
>>
>> I've used crc32 mostly because it's easily available in the code (i.e.
>> I'm lazy), but I've done some quick research / primitive benchmarking
>> too. For example hashing 2e9 integers takes this much time:
>>
>> FNV-1 = 11.9
>> FNV-1a = 11.9
>> jenkins = 38.8
>> crc32 = 32.0
>>
>> So it's not really "slow" and the uniformity seems to be rather good.
>>
>> I'll try FNV in the future, however I don't think that's the main issue
>> right now.
>>
> Hi Tomas,
>
> If you are going to use a function that is not currently in the code,
> please consider xxhash:
>
> http://code.google.com/p/xxhash/
>
> Here are some benchmarks for some of the faster hash functions:
>
> Name Speed Q.Score Author
> xxHash 5.4 GB/s 10
> MumurHash 3a 2.7 GB/s 10 Austin Appleby
> SpookyHash 2.0 GB/s 10 Bob Jenkins
> SBox 1.4 GB/s 9 Bret Mulvey
> Lookup3 1.2 GB/s 9 Bob Jenkins
> CityHash64 1.05 GB/s 10 Pike & Alakuijala
> FNV 0.55 GB/s 5 Fowler, Noll, Vo
> CRC32 0.43 GB/s 9
> MD5-32 0.33 GB/s 10 Ronald L. Rivest
> SHA1-32 0.28 GB/s 10

Hi,

thanks for the link. I repeated the simple benchmark (code is here:
http://pastebin.com/e9BS03MA) with these results:

FNV duration = 9.821
FNVa duration = 9.753
jenkins duration = 37.658
crc32 duration = 7.127
xxhash duration = 68.814

The only difference is that this time I added -O3 (which I forgot last
time). That's probably the reason why CRC32 even faster than FNV.

Anyway, xxhash seems to be much slower than the other functions, at
least in this particular case.

I don't think the poor results necessarily contradict the table of
results you've posted, because that's on "a large block of random data"
while I'm hashing very short amounts of data (4B / 8B). So my guess is
xxhash init takes much more time than the other algorithms. Chances are
it'd be faster on large amounts of data, but that's not really useful.

Maybe I'll revisit it in the future if I'll decide to add support for
varlena types. Until then I'll stick with crc32 which is fast and has
much better Quality score than FNV.

Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-10-07 19:57:46 Re: [COMMITTERS] pgsql: Add DISCARD SEQUENCES command.
Previous Message Robert Haas 2013-10-07 19:46:13 Re: plpgsql.print_strict_params