Re: Hash support for arrays

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: marcin mank <marcin(dot)mank(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hash support for arrays
Date: 2010-11-02 19:04:04
Message-ID: AANLkTinaX3DgEx4U8dyuR4eJs5YHEiH3x1iD2zHGD0Xj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> marcin mank <marcin(dot)mank(at)gmail(dot)com> writes:
>>>> This would make the hash the same for arrays with elements 32 apart swapped.
>>>
>>> Well, there are *always* going to be cases where you get the same hash
>>> value for two different inputs; it's unavoidable given that you have to
>>> combine N 32-bit hash values into only one 32-bit output.
>
>> Sure.  The goal is to make those hard to predict, though.
>
> Really?  I think "I don't understand when this fails" isn't obviously
> better than being able to predict when it fails ...

Isn't that the whole point of hash functions? Collisions are
inevitable, but you want them to be unpredictable. If you want a hash
function with predictable collision behavior, just has the first
element. It'll be highly predictable AND wicked fast.

>> I think
>> "multiply by 31 and add the next value" is a fairly standard way of
>> getting that behavior.  It mixes up the bits a lot more than just
>> left-shifting by a variable offset.
>
> What concerns me about that is that it tends to push the bits to the
> left --- I think the effects of the earlier inputs are going to overflow
> out as you incorporate a lot of newer inputs.  What you want is a scheme
> where every input item affects all the bits of the final result about
> equally, and my gut feeling is this doesn't provide that.

I don't think so. That would be a problem if you multiplied by an
even number. You can see that you don't get dups if you just add in
the same value over and over:

#include <stdio.h>

int
main(int argc, char **argv)
{
unsigned hv = 1;
unsigned i;

for (i = 0; i < 1000000; ++i)
{
hv = hv * 31 + 1;
printf("%08x\n", hv);
}
return 0;
}

No dups. Also, if you change that first hv = 1 line to hv = 2 and
rerun, that sequence also has no dups, and no collisions with the hv =
1 sequence.

> I'd be happier about this approach if there were some actual theory
> behind it ... maybe there's some analysis out there, but the one link
> that was given was just about entirely unconvincing.

I think it's from Knuth, though unfortunately I don't have a copy to
check. There are probably better algorithms out there, but this one's
pretty simple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2010-11-02 19:34:26 Re: Hash support for arrays
Previous Message Alvaro Herrera 2010-11-02 18:34:36 Re: Hash support for arrays