Re: Hash support for arrays

From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, marcin mank <marcin(dot)mank(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hash support for arrays
Date: 2010-11-04 12:35:39
Message-ID: 20101104123539.GE27429@aart.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 04, 2010 at 10:00:40AM +0000, Dean Rasheed wrote:
> On 3 November 2010 09:24, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com> wrote:
> > 2010/11/2 Kenneth Marshall <ktm(at)rice(dot)edu>:
> >
> >> Given that our hash implimentation mixes the input data well (It does.
> >> I tested it.) then a simple rotate-and-xor method is all that should
> >> be needed to maintain all of the needed information. The original
> >> hash function has done the heavy lifting in this case.
> >
> > Even with the perfect hash function for the elements, certain
> > combinations of elements could still lead to massive collisions. E.g.,
> > if repeated values are typical in the input data we are talking about,
> > then the rotate-and-xor method would still lead to collisions between
> > any array of the same values of certain lengths, regardless of the
> > value. In Tom's implementation, as he mentioned before, those
> > problematical lengths would be multiples of 32 (e.g., an array of 32
> > 1s would collide with an array of 32 2s would collide with an array of
> > 32 3s, etc).
> >
>
> Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any
> array of 32 identical elements will hash to either 0 or -1. Similarly
> various permutations or multiples of that array length will cause it
> to perform badly.
>
> The multiply-by-m algorithm doesn't have that weakness, provided m is
> chosen carefully. There are a couple of qualities a good algorithm
> should possess:
>
> 1). The bits from the individual element hash values should be
> distributed evenly so that no 2 different hash values would result in
> the same contribution to the final value. This is easy to achieve -
> just make sure that m is odd.
>
> 2). The way that each element's hash value bits are distributed should
> be different from the way that every other element's hash value bits
> are distributed. m=31 achieves this pretty well, although there are
> plenty of other equally valid choices.
>
> Regards,
> Dean
>
Hi Dean,

In my comment yesterday, I included a simple function that would
allow us to leverage our current hash functions mixing process to
scramble the bits effectively and retaining the maximum amount of
information in the hash.

Regards,
Ken

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message KaiGai Kohei 2010-11-04 12:49:07 contrib: auth_delay module
Previous Message Thom Brown 2010-11-04 12:05:01 Alter column to type serial