Re: Combining hash values

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Combining hash values
Date: 2016-08-01 00:52:06
Message-ID: CAEepm=3Tp65SUDYuOugTxrMx+tFVsdLrDL6AxbO0w2vsv3c1-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 1, 2016 at 4:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> 2. I suspect that this algorithm for combining hashes is weak, and
>> could amplify weaknesses in the hash functions feeding it.
>
> Very possibly, but ...

Concrete example: suppose a clever data type author defines a perfect
hash function that maps values to small integers. In terms of hash
collisions, this performs optimally in a single column hash join, agg,
index etc by definition, and seems like an entirely reasonable thing
to want to do, but performs terribly with Postgres's hash combination
algorithm as soon as you use several columns. The stuffing is all up
one end of the cushion, which is OK for a perfect hash on its own, but
we do very little to spread it around when combining. For two columns
that hash to 8 bit integers, we map 16 bits of information to only 9
bits:

postgres=# select count(*) as distinct_outputs,
postgres-# min(collisions),
postgres-# max(collisions),
postgres-# avg(collisions),
postgres-# stddev(collisions)
postgres-# from (
postgres(# select s1.i # (s2.i << 1),
postgres(# count(*) as collisions
postgres(# from generate_series(0, 255) as s1(i)
postgres(# cross join generate_series(0, 255) as s2(i)
postgres(# group by 1
postgres(# ) ss;
┌──────────────────┬─────┬─────┬──────────────────────┬────────┐
│ distinct_outputs │ min │ max │ avg │ stddev │
├──────────────────┼─────┼─────┼──────────────────────┼────────┤
│ 512 │ 128 │ 128 │ 128.0000000000000000 │ 0 │
└──────────────────┴─────┴─────┴──────────────────────┴────────┘
(1 row)

The Boost combiner does better. If I have this right:

postgres=# create or replace function hash_combine(seed bigint, hash bigint)
postgres-# returns bigint as
postgres-# $$
postgres$# select (seed # (hash + 2654435769 + (seed << 6) + (seed >> 2)))
postgres$# % 4294967296;
postgres$# $$ language sql;
CREATE FUNCTION
postgres=#
postgres=# select count(*) as distinct_outputs,
postgres-# min(collisions),
postgres-# max(collisions),
postgres-# avg(collisions),
postgres-# stddev(collisions)
postgres-# from (
postgres(# select hash_combine(hash_combine(0, s1.i), s2.i),
postgres(# count(*) as collisions
postgres(# from generate_series(0, 255) as s1(i)
postgres(# cross join generate_series(0, 255) as s2(i)
postgres(# group by 1
postgres(# ) ss;
┌──────────────────┬─────┬─────┬────────────────────┬────────────────────────┐
│ distinct_outputs │ min │ max │ avg │ stddev │
├──────────────────┼─────┼─────┼────────────────────┼────────────────────────┤
│ 16583 │ 1 │ 9 │ 3.9519990351564856 │ 0.70952439864334926106 │
└──────────────────┴─────┴─────┴────────────────────┴────────────────────────┘
(1 row)

Not as good as I hoped (maybe there is a bug in my query?), but not a
128 car pile-up.

There are probably other kinds of specialised (or poorly written) hash
functions that will work well or acceptably on their own but not so
well when combined without good mixing. Like Boost, we are dealing
with hashes produced by arbitrary hash functions that can be supplied
by user code, not just by the high quality built-in function from Bob
Jenkins.

>> Compare
>> Boost's hash_combine, which does some more work before XORing:
>
>> seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
>
> I can't help being reminded of Knuth's story about he tried to invent
> the world's best random number generator, and was disappointed when
> it almost immediately converged to a short repeating sequence. If
> there's any actual theoretical basis to the above, I'd be interested
> to see it. But as-is, the use of addition rather than XOR looks fishy,
> and the way the old seed is shifted around looks more likely to result
> in cancellation than anything else.

I provided boost::hash_combine as an example only because it's widely
known and used in used in C++ circles and familiar to me, but so far I
have only taken it on authority that it works. Here's what I managed
to dig up on it this morning: According to C++ N3876[1], a proposal
to add std::hash_combine (but not a specific algorithm) to the C++
standard library, Boost's algorithm comes from a paper called "Methods
for Identifying Versioned and Plagiarised Documents (2003)"[2], where
you can see this recipe for shifting and summing as equation 1, with
some discussion and pointers to other papers. But I see in N3876's
feedback (see the comment from Jeffrey Jasskin) that Bob Jenkins'
mixer (or at least an aspect of it: not requiring intermediate state
to fit in the final output size) is discussed as a potential
implementation. We already have that in our source tree... perhaps we
should use it.

>> That constant approximates the golden ratio (as a fraction of the 32
>> bit hash space), and it also appears in our hash_any and hash_uint32
>> functions.
>
> I think it's just somebody's idea of a magic random number. Your link
>> [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a
> provides some reasons to think it might be a good choice for a very
> specific application, but this is not that application --- in particular,
> it doesn't involve multiplying by that constant.

Yeah. About its use in his 1997 algorithm, Bob Jenkins says:

"The golden ratio really is an arbitrary value. Its purpose is to
avoid mapping all zeros to all zeros."

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf
[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.2680

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-08-01 04:07:25 Re: Broken order-of-operations in parallel query latch manipulation
Previous Message Tom Lane 2016-07-31 21:57:12 Re: Slowness of extended protocol