Re: Hash support for arrays

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "marcin mank" <marcin(dot)mank(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash support for arrays
Date: 2010-11-02 19:34:26
Message-ID: 4CD0217202000025000371A2@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:

>>> The goal is to make those hard to predict, though.
>>
>> Really? I think "I don't understand when this fails" isn't
>> obviously better than being able to predict when it fails ...
>
> Isn't that the whole point of hash functions? Collisions are
> inevitable, but you want them to be unpredictable.

Are you sure you're not confusing attributes of a good random number
generator with hash generation? (They have much in common, but I've
only seen concern with "hard to predict" when working on random
number algorithms, as for financial audits or jury selection.)

> If you want a hash function with predictable collision behavior,
> just has the first element. It'll be highly predictable AND
> wicked fast.

You're confusing "unpredictable" with "widely distributed", I think.
There's no reason that the hash value of an integer of the same size
as the hash shouldn't be the integer itself, for example. It's hard
to get more predictable than that, yet it works well for hash
lookups.

I know that for a hash of a character string, the total = (total *
31) + nextcharacter has been shown to be effective. That's hardly
random or hard to predict, but it does tend to spread out the hash
values well. Whether it is as good for arrays, I don't know. It
seems like a reasonable place to start, though, since a character
string is rather like an array of characters.

>> What concerns me about that is that it tends to push the bits to
>> the left --- I think the effects of the earlier inputs are going
>> to overflow out as you incorporate a lot of newer inputs. What
>> you want is a scheme where every input item affects all the bits
>> of the final result about equally, and my gut feeling is this
>> doesn't provide that.
>
> I don't think so. That would be a problem if you multiplied by an
> even number. You can see that you don't get dups if you just add
> in the same value over and over:

Yeah, that 1 in the low order position ensures that the impact of
any value which is ever added into the total is never entirely lost.

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-11-02 20:00:17 Re: ALTER TYPE recursion to typed tables
Previous Message Robert Haas 2010-11-02 19:04:04 Re: Hash support for arrays