Re: Do we want a hashset type?

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Joel Jacobson <joel(at)compiler(dot)org>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-23 06:40:34
Message-ID: CACJufxFXSpL67dL+bgvB4DiJo7ovp8bxvtW00L1hV7qDzUb_6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested is
somehow beyond the imagination...).

* A standard varlena array has the following internal structure:
* <vl_len_> - standard varlena header word
* <ndim> - number of dimensions of the array
* <dataoffset> - offset to stored data, or 0 if no nulls bitmap
* <elemtype> - element type OID
* <dimensions> - length of each array axis (C array of int)
* <lower bnds> - lower boundary of each dimension (C array of int)
* <null bitmap> - bitmap showing locations of nulls (OPTIONAL)
* <actual data> - whatever is the stored data

in set/multiset, we don't need {ndim,lower bnds}, since we are only one
dimension, also we don't need subscript.
So for set we can have following
* int32 vl_len_; /* varlena header (do not touch directly!) */
* int32 capacity; /* # of capacity */
* int32 dataoffset; /* offset to data, or 0 if no bitmap */
* int32 nelements; /* number of items added to the hashset */
* Oid elemtype; /* element type OID */
* <null bitmap> - bitmap showing locations of nulls (OPTIONAL)
* <bitmap> - bitmap showing this slot is empty or not ( I am not sure
this part)
* <actual data> - whatever is the stored data

many of the code in array_func.c can be reused.
array_isspace ==> set_isspace
ArrayMetaState ==> SetMetastate
ArrayCount ==> SetCount (similar to ArrayCount return the dimension
of set, should be zero (empty set) or one)
ArrayParseState ==> SetParseState
ReadArrayStr ==> ReadSetStr

attached is a demo shows that use array_func.c to parse cstring. have
similar effect of array_in.
for multiset_in set type input function. if no duplicate required then
multiset_in would just like array, so more code can be copied from
array_func.c
but if unique required then we need first palloc0(capacity * datums (size
per type)) then put valid value into to a specific slot?

On Fri, Jun 23, 2023 at 6:27 AM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> On 6/22/23 19:52, Joel Jacobson wrote:
> > On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> >> This is also what the SQL standard does for multisets - there's SQL:20nn
> >> draft at http://www.wiscorp.com/SQLStandards.html, and the <member
> >> predicate> section (p. 475) explains how this should work with NULL.
> >
> > I've looked again at the paper you mentioned and found something
> intriguing
> > in section 2.6 (b). I'm a bit puzzled about this: why would we want to
> return
> > null when we're certain it's not null but just doesn't have any elements?
> >
> > In the same vein, it says, "If it has more than one element, an
> exception is
> > raised." Makes sense to me, but what about when there are no elements at
> all?
> > Why not raise an exception in that case too?
> >
> > The ELEMENT function is designed to do one simple thing: return the
> element of
> > a multiset if the multiset has only 1 element. This seems very similar
> to how
> > our INTO STRICT operates, right?
> >
>
> I agree this looks a bit weird, but that's what I mentioned - this is an
> initial a proposal, outlining the idea. Inevitably some of the stuff
> will get reworked or just left out of the final version. It's useful
> mostly to explain the motivation / goal.
>
> I believe that's the case here - I don't think the ELEMENT got into the
> standard at all, and the NULL rules for the MEMBER OF clause seem not to
> have these strange bits.
>
> > The SQL:20nn seems to still be in draft form, and I can't help but
> wonder if we
> > should propose a bit of an improvement here:
> >
> > "If it doesn't have exactly one element, an exception is raised."
> >
> > Meaning, it would raise an exception both if there are more elements,
> > or zero elements (no elements).
> >
> > I think this would make the semantics more intuitive and less surprising.
> >
>
> Well, the simple truth is the draft is freely available, but you'd need
> to buy the final version. It doesn't mean it's still being worked on or
> that no SQL standard was released since then. In fact, SQL 2023 was
> released a couple weeks ago [1].
>
> It'd be interesting to know the version that actually got into the SQL
> standard (if at all), but I don't have access to the standard yet.
>
> regards
>
>
> [1] https://www.iso.org/standard/76584.html
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
I recommend David Deutsch's <<The Beginning of Infinity>>

Jian

Attachment Content-Type Size
set.c text/x-csrc 25.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2023-06-23 07:27:33 Re: Infinite Interval
Previous Message Alexander Pyhalov 2023-06-23 06:03:24 Re: memory leak in trigger handling (since PG12)