Re: [PATCHES] updated hash functions for postgresql v1

From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] updated hash functions for postgresql v1
Date: 2008-11-04 21:15:44
Message-ID: 20081104211544.GQ18362@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote:
> Just interested if you repeat your tests not with cracklib-dict,
> but using 8-bit words. From our experience we found many hash functions
> are optimized for 7-bit words and produce too many collisions
> for 8-bit words. That's why we use crc32.
>
> Oleg

I think that the lines:

uint32 from 1-1648379 309 319 347
(uint32 1-1948379)*256 309 314 304
(uint32 1-1948379)*16 310 314 324
"a"uint32 (i.e. a00001,a0002...) 320 321 312

uint32uint32 (i.e. uint64) 321 287 309

can count as 8-bit words if taken a byte at a time. In fact
that is how hash_any() treats them, as a character string
and a length. One of the design goals of the original 1997
hash function in lookup2 and the 2006 update in lookup3 is
to support keys of arbitrary arrangements of bits. I can run
any additional checks that you want since the test harness
is perl with Inline::C. If you are using crc32 his article in
Dr. Dobbs shows that CRC has a "2 into 1" funnel-15 and an
"11 into 10" funnel-100 unless you are using a generalized
CRC. Also, unless you can inline your CRC the Jenkins lookup3
is 5n+20 where CRC is 9n+3.

Regards,
Ken

> On Tue, 4 Nov 2008, Kenneth Marshall wrote:
>
>> Sorry about the delay for this update to the new hash
>> index implementation. I was trying to get the WAL logging
>> in place and forgot to post the actual patch. The WAL
>> for hash indexes will need to wait for 8.5, but I did
>> want to add back in the piece of the Bob Jenkins 2006
>> hash function that was stripped out of the initial
>> patch on application due to concerns about the randomness
>> of the resulting hash values. Here is a re-post of my
>> initial findings comparing the old/new Jenkins hash
>> from lookup2 and lookup3. I have added a third column
>> containing the results for the hash_any() resulting
>> from the attached patch as well as simple timing test
>> for a DB reindex both before and after patching.
>>
>> Also attached is a simple documentation patch updating
>> the note attached to the hash index description.
>>
>> Regards,
>> Ken
>> ----------------------------------------------------
>> Hi,
>>
>> I have finally had a chance to do some investigation on
>> the performance of the old hash mix() function versus
>> the updated mix()/final() in the new hash function. Here
>> is a table of my current results for both the old and the
>> new hash function. In this case cracklib refers to the
>> cracklib-dict containing 1648379 unique words massaged
>> in various ways to generate input strings for the hash
>> functions. The result is the number of collisions in the
>> hash values generated.
>>
>> hash input old new newv2
>> ---------- --- --- -----
>> cracklib 338 316 338
>> cracklib x 2 (i.e. clibclib) 305 319 300
>> cracklib x 3 (clibclibclib) 323 329 315
>> cracklib x 10 302 310 329
>> cracklib x 100 350 335 298
>> cracklib x 1000 314 309 315
>> cracklib x 100 truncated to char(100) 311 327 320
>>
>> uint32 from 1-1648379 309 319 347
>> (uint32 1-1948379)*256 309 314 304
>> (uint32 1-1948379)*16 310 314 324
>> "a"uint32 (i.e. a00001,a0002...) 320 321 312
>>
>> uint32uint32 (i.e. uint64) 321 287 309
>>
>> The different result columns are old = Jenkins 1996 hash
>> function(lookup2.c), new = Jenkins 2006 hash function
>> (lookup3.c), and newv2 = adaptation of current hash_any()
>> to incorporate the separate mix()/final() functions. As
>> you can see from the results, spliting the mix() and final()
>> apart does not result in any perceptible loss of randomness
>> in the hash assignment. I also ran a crude timing for a
>> reindex of the following database:
>>
>> CREATE TABLE dict (word text);
>> CREATE INDEX wordhash ON dict USING hash (word);
>> INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
>> INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
>> ... (21 times)
>>
>> REINDEX TABLE
>> ...
>>
>> The average time to reindex the table using our current
>> hash_any() without the separate mix()/final() was 1696ms
>> and 1482ms with the separate mix()/final() stages giving
>> almost 13% better performance for this stupid metric.
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2008-11-04 21:17:39 Re: ARRAY vars (was Enable pl/python to return records based on multiple OUT params)
Previous Message Robert Haas 2008-11-04 21:09:37 Re: BufferAccessStrategy for bulk insert