Re: Change GUC hashtable to use simplehash?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: John Naylor <johncnaylorls(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Gurjeet Singh <gurjeet(at)singh(dot)im>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Change GUC hashtable to use simplehash?
Date: 2023-11-22 22:34:32
Message-ID: 20231122223432.lywt4yz2bn7tlp27@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-22 16:27:56 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2023-11-22 15:56:21 -0500, Tom Lane wrote:
> >> GUC names are just about always short, though, so I'm not sure you've
> >> made your point?
>
> > With short I meant <= 6 characters (32 / 5 = 6.x). After that you're
> > overwriting bits that you previously set, without dispersing the "overwritten"
> > bits throughout the hash state.
>
> I'm less than convinced about the "overwrite" part:
>
> + /* Merge into hash ... not very bright, but it needn't be */
> + result = pg_rotate_left32(result, 5);
> + result ^= (uint32) ch;
>
> Rotating a 32-bit value 5 bits at a time doesn't result in successive
> characters lining up exactly, and even once they do, XOR is not
> "overwrite".

I didn't know what word to use, hence the air quotes. Yes, xor doesn't just
set the bits to the right hand side in, but it just affects data on a per-bit
basis, which easily can be cancelled out.

My understanding of writing hash functions is that every added bit mixed in
should have a ~50% chance of causing each other bit to flip. The proposed
function obviously doesn't get there.

It's worth noting that the limited range of the input values means that
there's a lot of bias toward some bits being set ('a' to 'z' all start with
0b011).

> I'm pretty dubious that we need something better than this.

Well, we know that the current attempt at a dedicated hashfunctions for this
does result in substantial amounts of conflicts. And it's hard to understand
such cases when you hit them, so I think it's better to avoid exposing
ourselves to such dangers, without a distinct need.

And I don't really see the need here to risk it, even if we are somewhat
confident it's fine.

If, which I mildly doubt, we can't afford to call murmurhash32 for every
character, we could just call it for 32/5 input characters together. Or we
could just load up to 8 characters into an 64bit integer, can call
murmurhash64.

Something roughly like

uint64 result;

while (*name)
{
uint64 value = 0;

for (int i = 0; i < 8 && *name; i++)
{
char ch = *name++;

value |= *name;
value = value << 8;
}

result = hash_combine64(result, murmurhash64(value));
}

The hash_combine use isn't quite right either, we should use the full
accumulator state of a proper hash function, but it's seems very unlikely to
matter here.

The fact that string_hash() is slow due to the strlen(), which causes us to
process the input twice and which is optimized to also handle very long
strings which typically string_hash() doesn't encounter, seems problematic far
beyond this case. We use string_hash() in a *lot* of places, and that strlen()
does regularly show up in profiles. We should fix that.

The various hash functions being external functions also shows up in a bunch
of profiles too. It's particularly ridiculous for cases like tag_hash(),
where the caller typically knows the lenght, but calls a routine in a
different translation unit, which obviously can't be optimized for a specific
length.

I think we ought to adjust our APIs around this:

1) The accumulator state of the hash functions should be exposed, so one can
accumulate values into the hash state, without reducing the internal state
to a single 32/64 bit variable.

2) For callers that know the length of data, we should use a static inline
hash function, rather than an external function call. This should include
special cased inline functions for adding 32/64bit of data to the hash
state.

Perhaps with a bit of logic to *not* use the inline version if the hashed
input is long (and thus the call overhead doesn't matter). Something like
if (__builtin_constant_p(len) && len < 128)
/* call inline implementation */
else
/* call out of line implementation, not worth the code bloat */

We know that hash functions should have the split into init/process
data*/finish steps, as e.g. evidenced by pg_crc.h/pg_crc32.h.

With something like that, you could write a function that lowercases
characters inline without incurring unnecessary overhead.

hash32_state hs;

hash32_init(&hs);

while (*name)
{
char ch = *name++;

/* crappy lowercase for this situation */
ch |= 0x20;

hash32_process_byte(&hs, ch);
}

return hash32_finish(&hs);

Perhaps with some additional optimization for processing the input string in
32/64 bit quantities.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-11-22 22:55:06 Re: [HACKERS] Changing references of password encryption to hashing
Previous Message Peter Smith 2023-11-22 22:32:32 Re: Stop the search once replication origin is found