Re: updated hash functions for postgresql v1

From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org, twraney(at)comcast(dot)net, neilc(at)samurai(dot)com
Subject: Re: updated hash functions for postgresql v1
Date: 2008-08-29 21:15:37
Message-ID: 20080829211536.GA16964@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

On Sun, Oct 28, 2007 at 08:06:58PM +0000, Simon Riggs wrote:
> On Sun, 2007-10-28 at 13:05 -0500, Kenneth Marshall wrote:
> > On Sun, Oct 28, 2007 at 05:27:38PM +0000, Simon Riggs wrote:
> > > On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote:
> > > > Its features include a better and faster hash function.
> > >
> > > Looks very promising. Do you have any performance test results to show
> > > it really is faster, when compiled into Postgres? Better probably needs
> > > some definition also; in what way are the hash functions better?
> > >
> > > --
> > > Simon Riggs
> > > 2ndQuadrant http://www.2ndQuadrant.com
> > >
> > The new hash function is roughly twice as fast as the old function in
> > terms of straight CPU time. It uses the same design as the current
> > hash but provides code paths for aligned and unaligned access as well
> > as separate mixing functions for different blocks in the hash run
> > instead of having one general purpose block. I think the speed will
> > not be an obvious win with smaller items, but will be very important
> > when hashing larger items (up to 32kb).
> >
> > Better in this case means that the new hash mixes more thoroughly
> > which results in less collisions and more even bucket distribution.
> > There is also a 64-bit varient which is still faster since it can
> > take advantage of the 64-bit processor instruction set.
>
> Ken, I was really looking for some tests that show both of the above
> were true. We've had some trouble proving the claims of other algorithms
> before, so I'm less inclined to take those things at face value.
>
> I'd suggest tests with Integers, BigInts, UUID, CHAR(20) and CHAR(100).
> Others may have different concerns.
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash input old new
---------- --- ---
cracklib 338 316
cracklib x 2 (i.e. clibclib) 305 319
cracklib x 3 (clibclibclib) 323 329
cracklib x 10 302 310
cracklib x 100 350 335
cracklib x 1000 314 309
cracklib x 100 truncated to char(100) 311 327

uint32 from 1-1648379 309 319
(uint32 1-1948379)*256 309 314
(uint32 1-1948379)*16 310 314
"a"uint32 (i.e. a00001,a0002...) 320 321

uint32uint32 (i.e. uint64) 321 287

In addition to these tests, the new mixing functions allow
the hash to pass the frog2.c test by Bob Jenkins. Here is
his comment from http://burtleburtle.net/bob/hash/doobs.html:

lookup3.c does a much more thorough job of mixing than any
of my previous hashes (lookup2.c, lookup.c, One-at-a-time).
All my previous hashes did a more thorough job of mixing
than Paul Hsieh's hash. Paul's hash does a good enough job
of mixing for most practical purposes.

The most evil set of keys I know of are sets of keys that are
all the same length, with all bytes zero, except with a few
bits set. This is tested by frog.c.. To be even more evil, I
had my hashes return b and c instead of just c, yielding a
64-bit hash value. Both lookup.c and lookup2.c start seeing
collisions after 2^53 frog.c keypairs. Paul Hsieh's hash sees
collisions after 2^17 keypairs, even if we take two hashes with
different seeds. lookup3.c is the only one of the batch that
passes this test. It gets its first collision somewhere beyond
2^63 keypairs, which is exactly what you'd expect from a completely
random mapping to 64-bit values.

If anyone has any other data for me to test with, please let me
know. I think this is a reasonable justification for including the
new mixing process (mix() and final()) as well as the word-at-a-time
processing in our hash function. I will be putting a small patch
together to add the new mixing process back in to the updated hash
function this weekend in time for the September commit-fest unless
there are objections. Both the old and the new hash functions meet
the strict avalanche conditions as well.
(http://home.comcast.net/~bretm/hash/3.html)

I have used an Inline::C perl driver for these tests and can post
it if others would like to use it as a testbed.

Regards,
Ken
avalance

In response to

Browse pgsql-patches by date

  From Date Subject
Next Message Karl Schnaitter 2008-08-29 22:00:04 fixing bug in combocid.c
Previous Message Alvaro Herrera 2008-08-29 17:32:07 Re: [PATCHES] pg_dump lock timeout