Re: Do we want a hashset type?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "jian he" <jian(dot)universality(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-11 10:26:39
Message-ID: 2040c023-1a52-4366-9716-8c8507bb6e32@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
> On 6/10/23 17:46, Andrew Dunstan wrote:
>>
>> Maybe you can post a full patch as well as incremental?
>>
>
> I wonder if we should keep discussing this extension here, considering
> it's going to be out of core (at least for now). Not sure how many
> pgsql-hackers are interested in this, so maybe we should just move it to
> github PRs or something ...

I think there are some good arguments that speaks in favour of including it in core:

1. It's a fundamental data structure. Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence "hashset".

2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
among SQL users to evaluate graph queries. Even if a future implementation of SQL/PGQ
would mean users wouldn't need to deal with the hashset type directly, the same
type could hopefully be used, in part or in whole, under the hood by the future
SQL/PGQ implementation. If low-level functionality is useful on its own, I think it's
a benefit of exposing it to users, since it allows system testing of the component
in isolation, even if it's primarily gonna be used as a smaller part of a larger more
high-level component.

3. I think there is a general need for hashset, experienced by myself, Andrew and
I would guess lots of others users. The general pattern that will be improved is
when you currently would do array_agg(DISTINCT ...)
probably there are other situations too, since it's a fundamental data structure.

On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 3) support for other types (now it only works with int32)
> I think we should decide what types we want/need to support, and add one
> or two types early. Otherwise we'll have code / on-disk format making
> various assumptions about the type length etc.
>
> I have no idea what types people use as node IDs - is it likely we'll
> need to support types passed by reference / varlena types? Or can we
> just assume it's int/bigint?

I think we should just support data types that would be sensible
to use as a PRIMARY KEY in a fully normalised data model,
which I believe would only include "int", "bigint" and "uuid".

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-06-11 12:41:27 Should heapam_estimate_rel_size consider fillfactor?
Previous Message vignesh C 2023-06-11 03:12:04 Re: Implement generalized sub routine find_in_log for tap test