Re: Hash Functions

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, amul sul <sulamul(at)gmail(dot)com>
Subject: Re: Hash Functions
Date: 2017-05-13 23:08:29
Message-ID: 20170513230829.rw5oocydnpvlqyfh@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-05-13 10:29:09 -0400, Robert Haas wrote:
> On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > Can we think of defining separate portable hash functions which can be
> > used for the purpose of hash partitioning?
>
> I think that would be a good idea. I think it shouldn't even be that
> hard. By data type:
>
> - Integers. We'd need to make sure that we get the same results for
> the same value on big-endian and little-endian hardware, and that
> performance is good on both systems. That seems doable.
>
> - Floats. There may be different representations in use on different
> hardware, which could be a problem. Tom didn't answer my question
> about whether any even-vaguely-modern hardware is still using non-IEEE
> floats, which I suspect means that the answer is "no". If every bit
> of hardware we are likely to find uses basically the same
> representation of the same float value, then this shouldn't be hard.
> (Also, even if this turns out to be hard for floats, using a float as
> a partitioning key would be a surprising choice because the default
> output representation isn't even unambiguous; you need
> extra_float_digits for that.)
>
> - Strings. There's basically only one representation for a string.
> If we assume that the hash code only needs to be portable across
> hardware and not across encodings, a position for which I already
> argued upthread, then I think this should be manageable.
>
> - Everything Else. Basically, everything else is just a composite of
> that stuff, I think.

I seriously doubt that's true. A lot of more complex types have
internal alignment padding and such. Consider e.g. something like
jsonb, hstore, or postgis types - you *can* convert them to something
that's unambiguous, but it's going to be fairly expensive. Essentially
you'd have to something like calling the output function, and then
hashing the result of that. And hash-partitioning is particularly
interesting for e.g. large amounts of points in a geospatial scenario,
because other types of partitioning are quite hard to maintain.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2017-05-14 03:03:39 Event triggers + table partitioning cause server crash in current master
Previous Message Jeff Davis 2017-05-13 23:01:17 Re: Hash Functions