Re: Do we want a hashset type?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "jian he" <jian(dot)universality(at)gmail(dot)com>
Cc: "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, "Tom Dunstan" <pgsql(at)tomd(dot)cc>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-16 10:24:43
Message-ID: 242f23fb-3780-4e04-8fee-335fef8a6ae6@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

New patch attached:

Add customizable params to int4hashset() and collision count function

This commit enhances int4hashset() by introducing adjustable capacity,
load, and growth factors, providing flexibility for performance optimization.

Also added is a new function, hashset_collisions(), to report collision
counts, aiding in performance tuning.

Aggregate functions are renamed to hashset_agg() for consistency with
array_agg() and range_agg().

A new test file, test/sql/benchmark.sql, is added for evaluating the
performance of hash functions. It's not run automatically by
make installcheck.

The adjustable parameters and the naive hash function are useful for testing
and performance comparison. However, to keep things simple and streamlined
for users, these features are likely to be removed in the final release,
emphasizing the use of well-optimized default settings.

SQL-function indentation is also adjusted to align with the PostgreSQL
source repo, improving readability.

In the benchmark results below, it was a bit surprising the naive hash
function had no collisions, but that only held true when the input
elements were sequential integers. When tested with random integers,
all three hash functions caused collisions.

Timing results not statistical significant, the purpose is just to
give an idea of the execution times.

*** Elements in sequence 1..100000
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:23: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:23: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:23: NOTICE: hashset_collisions: 31195
DO
Time: 1342.564 ms (00:01.343)
- Testing Murmurhash32
psql:test/sql/benchmark.sql:40: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:40: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:40: NOTICE: hashset_collisions: 30879
DO
Time: 1297.823 ms (00:01.298)
- Testing naive hash function
psql:test/sql/benchmark.sql:57: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:57: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:57: NOTICE: hashset_collisions: 0
DO
Time: 1400.936 ms (00:01.401)
*** Testing 100000 random ints
setseed
---------

(1 row)

Time: 3.591 ms
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:77: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:77: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:77: NOTICE: hashset_collisions: 30919
DO
Time: 1415.497 ms (00:01.415)
setseed
---------

(1 row)

Time: 1.282 ms
- Testing Murmurhash32
psql:test/sql/benchmark.sql:95: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:95: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:95: NOTICE: hashset_collisions: 30812
DO
Time: 2079.202 ms (00:02.079)
setseed
---------

(1 row)

Time: 0.122 ms
- Testing naive hash function
psql:test/sql/benchmark.sql:113: NOTICE: hashset_count: 100000
psql:test/sql/benchmark.sql:113: NOTICE: hashset_capacity: 262144
psql:test/sql/benchmark.sql:113: NOTICE: hashset_collisions: 30822
DO
Time: 1613.965 ms (00:01.614)

/Joel

Attachment Content-Type Size
hashset-0.0.1-184a18a.patch application/octet-stream 37.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-06-16 10:31:37 Re: Support logical replication of DDLs
Previous Message Nikita Malakhov 2023-06-16 10:18:49 Re: Pluggable toaster