Re: Change GUC hashtable to use simplehash?

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Junwang Zhao <zhjwpku(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <gurjeet(at)singh(dot)im>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Change GUC hashtable to use simplehash?
Date: 2024-01-22 02:03:38
Message-ID: CANWCAZbTUk2LOyhsFo33gjLyLAHZ7ucXCi5K9u=+PtnTShDKtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> fasthash_init(&hs, sizeof(Datum), kind);
> fasthash_accum(&hs, (char *) &value, sizeof(Datum));
> return fasthash_final32(&hs, 0);

It occurred to me that it's strange to have two places that length can
be passed. That was a side effect of the original, which used length
to both know how many bytes to read, and to modify the internal seed.
With the incremental API, it doesn't make sense to pass the length (or
a dummy macro) up front -- with a compile-time fixed length, it can't
possibly break a tie, so it's just noise.

0001 removes the length from initialization in the incremental
interface. The standalone functions use length directly the same as
before, but after initialization. Thoughts?

Also, the fasthash_accum call is a bit verbose, because it's often
used in a loop with varlen input. For register-sized values, I think
it's simpler to say this, as done in the search path cache, so maybe a
comment to that effect would be helpful:

hs.accum = value;
fasthash_combine(&hs);

I noticed that we already have a more recent, stronger 64-bit mixer
than murmur64: splitmix64, in pg_prng.c. We could put that, as well as
a better 4-byte mixer [1] in hashfn_unstable.h, for in-memory use.
Maybe with names like "hash_4bytes" etc. so it's not tied to a
specific implementation. I see one simplehash case that can use it,
even if the resowner hash table gets rid of it.

0002 and 0003 use fasthash for dynahash and GUC hash, respectively.
These cannot use the existing cstring hashing directly because of
truncation and case-folding, respectively. (Some simplehash uses can,
but that can come later)

On Sun, Jan 21, 2024 at 8:06 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> After having stepped away from this work for a couple weeks and
> returning to it, I think the comments and/or naming could be more
> clear. We first use the result of haszero64() as a boolean to break out
> of the loop, but then later use it in a more interesting way to count
> the number of remaining bytes.
>
> Perhaps you can take the comment out of the loop and just describe the
> algorithm we're using, and make a note that we have to byteswap first.
> "Indeterminate" could be explained briefly as well.

v15-0004 is a stab at that. As an idea, it also renames zero_bytes_le
to zero_byte_low to reflect the effect better. There might be some
other comment edits needed to explain usage, so I plan to hold on to
this for later. Let me know what you think.

[1] Examples of both in
https://www.boost.org/doc/libs/1_84_0/boost/container_hash/detail/hash_mix.hpp

Attachment Content-Type Size
v15-0003-Use-fasthash-for-guc_name_hash.patch text/x-patch 1.8 KB
v15-0004-WIP-comment-edits.patch text/x-patch 3.0 KB
v15-0001-Initialization-of-incremental-hashing-no-longer-.patch text/x-patch 2.8 KB
v15-0002-Use-fasthash-for-dynahash-s-default-string-hash.patch text/x-patch 4.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-01-22 02:10:09 Re: MERGE ... WHEN NOT MATCHED BY SOURCE
Previous Message Peter Smith 2024-01-22 01:57:02 Re: Support TZ format code in to_timestamp()