Re: Hash support for arrays

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, marcin mank <marcin(dot)mank(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hash support for arrays
Date: 2010-11-02 20:07:47
Message-ID: AANLkTimodWdPAfdHnbhw5iRV66GXs-5Jhu=rQVf71S5U@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 2, 2010 at 12:34 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>
>>>> The goal is to make those hard to predict, though.
>>>
>>> Really?  I think "I don't understand when this fails" isn't
>>> obviously better than being able to predict when it fails ...
>>
>> Isn't that the whole point of hash functions?  Collisions are
>> inevitable, but you want them to be unpredictable.
>
> Are you sure you're not confusing attributes of a good random number
> generator with hash generation?  (They have much in common, but I've
> only seen concern with "hard to predict" when working on random
> number algorithms, as for financial audits or jury selection.)
>
>> If you want a hash function with predictable collision behavior,
>> just has the first element.  It'll be highly predictable AND
>> wicked fast.
>
> You're confusing "unpredictable" with "widely distributed", I think.
> There's no reason that the hash value of an integer of the same size
> as the hash shouldn't be the integer itself, for example.  It's hard
> to get more predictable than that, yet it works well for hash
> lookups.

Well, no, not really. For example, it may be that you have a hash
table with N buckets and values that of the form N+k, 2N+k, 3N+k, ....
and everything will collide.

If you do some mixing of the bits, that case is far less likely to
occur in practice.

See for example http://www.azillionmonkeys.com/qed/hash.html

> I know that for a hash of a character string, the total = (total *
> 31) + nextcharacter has been shown to be effective.  That's hardly
> random or hard to predict, but it does tend to spread out the hash
> values well.  Whether it is as good for arrays, I don't know.  It
> seems like a reasonable place to start, though, since a character
> string is rather like an array of characters.

That was my thought.

>>> What concerns me about that is that it tends to push the bits to
>>> the left --- I think the effects of the earlier inputs are going
>>> to overflow out as you incorporate a lot of newer inputs.  What
>>> you want is a scheme where every input item affects all the bits
>>> of the final result about equally, and my gut feeling is this
>>> doesn't provide that.
>>
>> I don't think so.  That would be a problem if you multiplied by an
>> even number.  You can see that you don't get dups if you just add
>> in the same value over and over:
>
> Yeah, that 1 in the low order position ensures that the impact of
> any value which is ever added into the total is never entirely lost.

Right...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2010-11-02 20:25:16 Re: Hash support for arrays
Previous Message Robert Haas 2010-11-02 20:00:17 Re: ALTER TYPE recursion to typed tables