Re: [PATCHES] updated hash functions for postgresql v1

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kenneth Marshall <ktm(at)rice(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] updated hash functions for postgresql v1
Date: 2008-11-04 20:32:47
Message-ID: Pine.LNX.4.64.0811042330110.15810@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Just interested if you repeat your tests not with cracklib-dict,
but using 8-bit words. From our experience we found many hash functions
are optimized for 7-bit words and produce too many collisions
for 8-bit words. That's why we use crc32.

Oleg
On Tue, 4 Nov 2008, Kenneth Marshall wrote:

> Sorry about the delay for this update to the new hash
> index implementation. I was trying to get the WAL logging
> in place and forgot to post the actual patch. The WAL
> for hash indexes will need to wait for 8.5, but I did
> want to add back in the piece of the Bob Jenkins 2006
> hash function that was stripped out of the initial
> patch on application due to concerns about the randomness
> of the resulting hash values. Here is a re-post of my
> initial findings comparing the old/new Jenkins hash
> from lookup2 and lookup3. I have added a third column
> containing the results for the hash_any() resulting
> from the attached patch as well as simple timing test
> for a DB reindex both before and after patching.
>
> Also attached is a simple documentation patch updating
> the note attached to the hash index description.
>
> Regards,
> Ken
> ----------------------------------------------------
> Hi,
>
> I have finally had a chance to do some investigation on
> the performance of the old hash mix() function versus
> the updated mix()/final() in the new hash function. Here
> is a table of my current results for both the old and the
> new hash function. In this case cracklib refers to the
> cracklib-dict containing 1648379 unique words massaged
> in various ways to generate input strings for the hash
> functions. The result is the number of collisions in the
> hash values generated.
>
> hash input old new newv2
> ---------- --- --- -----
> cracklib 338 316 338
> cracklib x 2 (i.e. clibclib) 305 319 300
> cracklib x 3 (clibclibclib) 323 329 315
> cracklib x 10 302 310 329
> cracklib x 100 350 335 298
> cracklib x 1000 314 309 315
> cracklib x 100 truncated to char(100) 311 327 320
>
> uint32 from 1-1648379 309 319 347
> (uint32 1-1948379)*256 309 314 304
> (uint32 1-1948379)*16 310 314 324
> "a"uint32 (i.e. a00001,a0002...) 320 321 312
>
> uint32uint32 (i.e. uint64) 321 287 309
>
> The different result columns are old = Jenkins 1996 hash
> function(lookup2.c), new = Jenkins 2006 hash function
> (lookup3.c), and newv2 = adaptation of current hash_any()
> to incorporate the separate mix()/final() functions. As
> you can see from the results, spliting the mix() and final()
> apart does not result in any perceptible loss of randomness
> in the hash assignment. I also ran a crude timing for a
> reindex of the following database:
>
> CREATE TABLE dict (word text);
> CREATE INDEX wordhash ON dict USING hash (word);
> INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
> INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
> ... (21 times)
>
> REINDEX TABLE
> ...
>
> The average time to reindex the table using our current
> hash_any() without the separate mix()/final() was 1696ms
> and 1482ms with the separate mix()/final() stages giving
> almost 13% better performance for this stupid metric.
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ron Mayer 2008-11-04 20:34:35 Re: Patch for SQL-Standard Interval output and decoupling DateStyle from IntervalStyle
Previous Message Kenneth Marshall 2008-11-04 20:26:55 Re: [PATCHES] updated hash functions for postgresql v1