Re: Change GUC hashtable to use simplehash?

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Gurjeet Singh <gurjeet(at)singh(dot)im>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Change GUC hashtable to use simplehash?
Date: 2023-11-29 13:31:21
Message-ID: CANWCAZb_17xdPP7v1_mXXg5TVbWkDzkP_WDgbCWpqUvz_VZ+OQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is a rough start with Andres's earlier ideas, to get
something concrete out there.

I took a look around at other implementations a bit. Many modern hash
functions use MUM-style hashing, which typically uses 128-bit
arithmetic. Even if they already have an incremental interface and
have a compatible license, it seems a bit too much work to adopt just
for a couple string use cases. Might be useful elsewhere, though, but
that's off topic.

However, I did find a couple hash functions that are much simpler to
adapt to a bytewise interface, pass SMHasher, and are decently fast on
short inputs:

- fast-hash, MIT licensed, and apparently has some use in software [1]
- MX3, CC0 license (looking around, seems controversial for a code
license, so didn't go further). [2] Seems to be a for-fun project, but
the accompanying articles are very informative on how to develop these
things.

After wacking fast-hash around, it doesn't really resemble the
original much, and if for some reason we went as far as switching out
the mixing/final functions, it may as well be called completely
original work. I thought it best to start with something whose mixing
behavior passes SMHasher, and hopefully preserve that property.

Note that the combining and final steps share most of their arithmetic
operations. This may have been done on purpose to minimize binary
size, but I didn't check. Also, it incorporates input length into the
calculation. Since we don't know the length of C strings up front, I
threw that out for now. It'd be possible to track the length as we go
and incorporate something into the final step. The hard part is
verifying it hasn't lost any quality.

v5-0001 puts fash-hash as-is into a new header, named in a way to
convey in-memory use e.g. hash tables.

v5-0002 does the minimal to allow dynash to use this for string_hash,
inlined but still calling strlen.

v5-0003 shows one way to do a incremental interface. It might be okay
for simplehash with fixed length keys, but seems awkward for strings.

v5-0004 shows a bytewise incremental interface, with implementations
for dynahash (getting rid of strlen) and guc hash.

[1] https://code.google.com/archive/p/fast-hash/
[2] https://github.com/jonmaiga/mx3

Attachment Content-Type Size
v5-0004-Add-bytewise-interface-for-incrementing-the-hash-.patch text/x-patch 3.8 KB
v5-0003-Add-incremental-interface-to-fasthash-and-use-it-.patch text/x-patch 3.6 KB
v5-0001-Vendor-fasthash.patch text/x-patch 3.3 KB
v5-0002-Inline-string_hash-and-call-out-to-fasthash-inste.patch text/x-patch 4.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-11-29 13:34:24 Re: Python installation selection in Meson
Previous Message Tomas Vondra 2023-11-29 13:28:47 Re: logical decoding and replication of sequences, take 2