Re: Do we want a hashset type?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "jian he" <jian(dot)universality(at)gmail(dot)com>
Cc: "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, "Tom Dunstan" <pgsql(at)tomd(dot)cc>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-28 08:50:20
Message-ID: 73e4c933-320b-4dce-ae38-c0873d018335@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 28, 2023, at 08:26, jian he wrote:

> Hi there.
> I changed the function hashset_contains to strict.

Changing hashset_contains to STRICT would cause it to return NULL
if any of the operands are NULL, which I don't believe is correct, since:

SELECT NULL = ANY('{}'::int4[]);
?column?
----------
f
(1 row)

Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.

Did you try running `make installcheck` after your change?
You would then have seen one of the tests failing:

test array-and-multiset-semantics ... FAILED 21 ms

Check the content of `regression.diffs` to see why:

% cat regression.diffs
diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out /Users/joel/src/hashset/results/array-and-multiset-semantics.out
--- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 2023-06-27 10:07:38
+++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out 2023-06-28 10:13:27
@@ -158,7 +158,7 @@
| | {NULL} | {NULL} | |
| 1 | {1} | {1} | |
| 4 | {4} | {4} | |
- {} | | {NULL} | {NULL} | f | f
+ {} | | {NULL} | {NULL} | | f
{} | 1 | {1} | {1} | f | f
{} | 4 | {4} | {4} | f | f
{NULL} | | {NULL} | {NULL,NULL} | |
@@ -284,7 +284,8 @@
"= ANY(...)";
arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
------+------+-------------+--------------+------------------+------------
-(0 rows)
+ {} | | {NULL} | {NULL} | | f
+(1 row)

> also change the way to return an empty array.

Nice.
I agree the `Datum d` variable was unnecessary.
I also removed the unused includes.

> in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> int4hashset can speed distinct aggregate and distinct counts?
> like the following:
>
> explain(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
>
> explain(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3

The 100k tables seems to be too small to give any meaningful results,
when trying to measure individual queries:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
Execution Time: 26.790 ms
Execution Time: 30.616 ms
Execution Time: 33.253 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k;
Execution Time: 32.797 ms
Execution Time: 27.605 ms
Execution Time: 26.228 ms

If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
is actually faster for the `i` column where all input integers are unique:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
Execution Time: 799.017 ms
Execution Time: 796.008 ms
Execution Time: 799.121 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_10M;
Execution Time: 1204.873 ms
Execution Time: 1221.822 ms
Execution Time: 1216.340 ms

For random integers, hashset is a win though:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
Execution Time: 1874.722 ms
Execution Time: 1878.760 ms
Execution Time: 1861.640 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(rnd) FROM benchmark_input_10M;
Execution Time: 1253.709 ms
Execution Time: 1222.651 ms
Execution Time: 1237.849 ms

> explain(costs off,timing off, analyze,buffers)
> select count(distinct rnd) from benchmark_input_100k \watch c=3
>
> explain(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> benchmark_input_100k) sub(x) \watch c=3

I tried these with 10M:

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
Execution Time: 1733.320 ms
Execution Time: 1725.214 ms
Execution Time: 1716.636 ms

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x);
Execution Time: 1249.612 ms
Execution Time: 1240.558 ms
Execution Time: 1252.103 ms

Not sure what I think of the current benchmark suite.

I think it would be better to only include some realistic examples from
real-life, such as the graph query which was the reason I personally started
working on this. Otherwise there is a risk we optimise for some hypothetical
scenario that is not relevant in practise.

Would be good with more examples of typical work loads for when the hashset
type would be useful.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhang Mingli 2023-06-28 08:52:34 Targetlist lost when CTE join <targetlist lost when CTE join>
Previous Message Peter Eisentraut 2023-06-28 08:45:02 Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs