Re: Do we want a hashset type?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "jian he" <jian(dot)universality(at)gmail(dot)com>
Subject: Re: Do we want a hashset type?
Date: 2023-06-12 20:36:01
Message-ID: f342049a-efb7-45e4-b161-10b2eb63d045@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
> But those are numbers for large keys - if you restrict the input to
> 4B-16B (which is what we planned to do by focusing on int, bigint and
> uuid), there's no significant difference:

Oh, sorry, I completely failed to read the meaning of the Columns.

> lookup3:
>
> Small key speed test - 4-byte keys - 30.17 cycles/hash
> Small key speed test - 8-byte keys - 31.00 cycles/hash
> Small key speed test - 16-byte keys - 49.00 cycles/hash
>
> xxh3low:
>
> Small key speed test - 4-byte keys - 29.00 cycles/hash
> Small key speed test - 8-byte keys - 29.58 cycles/hash
> Small key speed test - 16-byte keys - 37.00 cycles/hash

The winner of the "Small key speed test" competition seems to be:

ahash64 "ahash 64bit":
Small key speed test - 4-byte keys - 24.00 cycles/hash
Small key speed test - 8-byte keys - 24.00 cycles/hash
Small key speed test - 16-byte keys - 26.98 cycles/hash

Looks like it's a popular one, e.g. it's used by Rust in their std::collections::HashSet.

Another notable property of ahash64 is that it's "DOS resistant",
but it isn't crucial for our use case, since we exclusively target
auto-generated primary keys which are not influenced by end-users.

> Not sure what you mean by "optimizing for space" - I imagined the
> on-disk format would just use the same hash table with tiny amount of
> free space (say 10% and not ~%50).

With "optimizing for space" I meant trying to find some alternative or
intermediate data structure that is more likely to be compressible,
like your idea of sorting the data.
What I wondered was if the on-disk format would be affected by
the choice of hash function. I guess it wouldn't, if the hashset
is created by adding the elements one-by-one by iterating
through the elements by reading the on-disk format.
But I thought maybe something more advanced could be
done, where conversion between the on-disk format
and the in-memory format could be done without naively
iterating through all elements, i.e. something less complex
than O(n).
No idea what that mechanism would be though.

> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
> (through hash_bytes or something), and at the same time make it possible
> to switch to a different function in the future. I'd store and ID of the
> hash function in the set, so that we can support a different algorithm
> in the future, if we choose to.

Sounds good to me.

Smart idea to include the hash function algorithm ID in the set,
to allow implementing a different one in the future!

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-06-12 20:38:59 Shouldn't construct_array_builtin and deconstruct_array_builtin agree on types?
Previous Message Peter Eisentraut 2023-06-12 20:33:16 Re: Documentation for building with meson