Re: Do we want a hashset type?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, jian he <jian(dot)universality(at)gmail(dot)com>
Subject: Re: Do we want a hashset type?
Date: 2023-06-12 19:58:56
Message-ID: d6bf1a75-e86b-6563-1478-6345d804cdda@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/12/23 19:34, Joel Jacobson wrote:
> On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>>> 1) better hash table implementation
>
> I noticed src/include/common/hashfn.h provides implementation
> of the Jenkins/lookup3 hash function, and thought maybe
> we could simply use it in hashset?
>
> However, I noticed that according to SMHasher [1],
> Jenkins/lookup3 has some quality problems:
> "UB, 28% bias, collisions, 30% distr, BIC"
>
> Not sure if that's true or maybe not a problem in the PostgreSQL implementation?
> > According to SHHasher, the two fastest 32/64-bit hash functions
> for non-cryptographic purposes without any quality problems
> that are also portable seems to be these two:
>
> wyhash v4.1 (64-bit) [2]
> MiB/sec: 22513.04
> cycl./hash: 29.01
> size: 474
>
> xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
> MiB/sec: 20722.94
> cycl./hash: 30.26
> size: 756
>
> [1] https://github.com/rurban/smhasher
> [2] https://github.com/wangyi-fudan/wyhash
> [3] https://github.com/Cyan4973/xxHash
>

But those are numbers for large keys - if you restrict the input to
4B-16B (which is what we planned to do by focusing on int, bigint and
uuid), there's no significant difference:

lookup3:

Small key speed test - 4-byte keys - 30.17 cycles/hash
Small key speed test - 8-byte keys - 31.00 cycles/hash
Small key speed test - 16-byte keys - 49.00 cycles/hash

xxh3low:

Small key speed test - 4-byte keys - 29.00 cycles/hash
Small key speed test - 8-byte keys - 29.58 cycles/hash
Small key speed test - 16-byte keys - 37.00 cycles/hash

But you can try doing some measurements, of course. Or just do profiling
to see how much time we spend in the hash function - I'd bet it's pretty
tiny fraction of the total time.

As for the "quality" issues - it's the same algorithm in Postgres, so it
has the same issues. I don't if that has measurable impact, though. I'd
guess it does not, particularly for "reasonably small" sets.

>>>> 5) more efficient storage format, with versioning etc.
>> I think the main question is whether to serialize the hash table as is,
>> or compact it in some way. The current code just uses the same thing for
>> both cases - on-disk format and in-memory representation (aggstate).
>> That's simple, but it also means the on-disk value is likely not well
>> compressible (because it's ~50% random data.
>>
>> I've thought about serializing just the values (as a simple array), but
>> that defeats the whole purpose of fast membership checks. I have two ideas:
>>
>> a) sort the data, and use binary search for this compact variant (and
>> then expand it into "full" hash table if needed)
>>
>> b) use a more compact hash table (with load factor much closer to 1.0)
>>
>> Not sure which if this option is the right one, each has cost for
>> converting between formats (but large on-disk value is not free either).
>>
>> That's roughly what I did for the tdigest extension.
>
> Is the choice of hash function (and it's in-memory representation)
> independent of the on-disk format question, i.e. could we work
> on experimenting and evaluating different hash functions first,
> to optimise for speed and quality, and then when done, proceed
> and optimise for space, or are the two intertwined somehow?
>

Not sure what you mean by "optimizing for space" - I imagined the
on-disk format would just use the same hash table with tiny amount of
free space (say 10% and not ~%50).

My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
(through hash_bytes or something), and at the same time make it possible
to switch to a different function in the future. I'd store and ID of the
hash function in the set, so that we can support a different algorithm
in the future, if we choose to.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-06-12 20:33:16 Re: Documentation for building with meson
Previous Message Andres Freund 2023-06-12 19:24:30 Re: Let's make PostgreSQL multi-threaded