Re: Do we want a hashset type?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, jian he <jian(dot)universality(at)gmail(dot)com>
Cc: Tom Dunstan <pgsql(at)tomd(dot)cc>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-19 12:59:01
Message-ID: 24e60a0e-b067-7f74-9f48-09f7d056144e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/19/23 13:33, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote:
>> AFAICS the standard only defines arrays and multisets. Arrays are pretty
>> much the thing we have, including the ARRAY[] constructor etc. Multisets
>> are similar to hashset discussed here, except that it tracks the number
>> of elements for each value (which would be trivial in hashset).
>>
>> So if we want to make this a built-in feature, maybe we should aim to do
>> the multiset thing, with the standard SQL syntax? Extending the grammar
>> should not be hard, I think. I'm not sure of the underlying code
>> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
>> lot of separate code doing that.
>
> Multisets handle duplicates uniquely, this may bring unexpected issues. Sets
> and multisets have distinct utility in C++, Rust, Java, etc. However, sets are
> more fundamental and prevalent in std libs than multisets.
>

What unexpected issues you mean? Sure, if someone uses multisets as if
they were sets (so ignoring the handling of duplicates), things will go
booom! quickly.

I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
operators) that'd allow perhaps help with this.

> Despite SQL's multiset possibility, a distinct hashset type is my preference,
> helping appropriate data structure choice and reducing misuse.
>
> The necessity of multisets is vague beyond standards compliance.
>

True - we haven't had any requests/proposal to implement MULTISETs.

I've looked at the SQL standard primarily to check if maybe there's some
precedent that'd give us guidance on the SQL syntax etc. And I think
multisets are that - even if we end up not implementing them, it'd be
sad to have unnecessarily inconsistent syntax (in case someone decides
to add multisets in the future).

We could invent "SET" data type, so while standard has ARRAY / MULTISET,
we'd have ARRAY / MULTISET / SET, and the difference between the last
two would be just handling of duplicates.

The other way to look at sets is that they are pretty similar to arrays,
except that there are no duplicates and order does not matter. Sure, the
on-disk format and code is different, but from the SQL perspective it'd
be nice to allow using sets in most places where arrays are allowed
(which is what the standard does for MULTISETS, more or less).

That'd mean we could probably search through gram.y for places working
with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
make them work with sets too, say by having SET_SUBLINK instead of
ARRAY_SUBLINK, set_expression instead of array_expression, etc.

This might be also "consistent" with defining hashset type using CREATE
TYPE with ELEMENT, because we consider the type to be "array". So that
would be polymorphic type, but we don't have pre-defined array for every
type (and I'm not sure we want to).

Of course, maybe there's some fatal flaw in these idea, I don't know.
And I don't want to move the goalposts too far - but it seems like this
might make some stuff actually simpler to implement (by piggy-backing on
the existing array infrastructure).

A mostly unrelated thought - I wonder if this might be somehow related
to the foreign key array patch ([1] might be the most recent attempt in
this direction). Not to hashset itself, but I recalled these patches
because it'd mean we don't need the separate "edges" link table (so the
hashset column would be the think backing the FK).

[1]
https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2023-06-19 13:31:14 Re: psql: Add role's membership options to the \du+ command
Previous Message Jelte Fennema 2023-06-19 12:49:44 Re: Deleting prepared statements from libpq.