Re: Do we want a hashset type?

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Tom Dunstan <pgsql(at)tomd(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-19 00:00:00
Message-ID: CACJufxF_Eo-=1_ffOGaQQzubKEGDgko1tD55VQWo-9=-xAntEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 17, 2023 at 8:38 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
> > I realise int4hashset_hash() is broken,
> > since two int4hashset's that are considered equal,
> > can by coincidence get different hashes:
> ...
> > Do we have any ideas on how to fix this without sacrificing performance?
>
> The problem was due to hashset_hash() function accumulating the hashes
> of individual elements in a non-commutative manner. As a consequence, the
> final hash value was sensitive to the order in which elements were inserted
> into the hashset. This behavior led to inconsistencies, as logically
> equivalent sets (i.e., sets with the same elements but in different orders)
> produced different hash values.
>
> Solved by modifying the hashset_hash() function to use a commutative operation
> when combining the hashes of individual elements. This change ensures that the
> final hash value is independent of the element insertion order, and logically
> equivalent sets produce the same hash.
>
> An somewhat unfortunate side-effect of this fix, is that we can no longer
> visually sort the hashset output format, since it's not lexicographically sorted.
> I think this is an acceptable trade-off for a hashset type,
> since the only alternative I see would be to sort the elements,
> but then it wouldn't be a hashset, but a treeset, which different
> Big-O complexity.
>
> New patch is attached, which will henceforth always be a complete patch,
> to avoid the hassle of having to assemble incremental patches.
>
> /Joel

select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?
---------------------------------------------------------------------------------
SELECT attname
,pc.relname
,CASE attstorage
WHEN 'p' THEN 'plain'
WHEN 'e' THEN 'external'
WHEN 'm' THEN 'main'
WHEN 'x' THEN 'extended'
END AS storage
FROM pg_attribute pa
join pg_class pc on pc.oid = pa.attrelid
where attnum > 0 and pa.attstorage = 'e';

In my system catalog, it seems only the hashset type storage =
'external'. most is extended.....
I am not sure the consequence of switch from external to extended.
------------------------------------------------------------------------------------------------------------
select hashset_hash('{-1,1}') as a1
,hashset_hash('{1,-2}') as a2
,hashset_hash('{-3,1}') as a3
,hashset_hash('{4,1}') as a4;
returns:
a1 | a2 | a3 | a4
-------------+-----------+------------+------------
-1735582196 | 998516167 | 1337000903 | 1305426029
(1 row)

values {a1,a2,a3,a4} should be monotone increasing, based on the
function int4hashset_cmp, but now it's not.
so the following queries failed.

--should return only one row.
select hashset_cmp('{2,1}','{3,1}')
union
select hashset_cmp('{3,1}','{4,1}')
union
select hashset_cmp('{1,3}','{4,1}');

select hashset_cmp('{9,10,11}','{10,9,-11}') =
hashset_cmp('{9,10,11}','{10,9,-1}'); --should be true
select '{2,1}'::int4hashset > '{7}'::int4hashset; --should be false.
based on int array comparison,.
-----------------------------------------------------------------------------------------
I comment out following lines in hashset-api.c somewhere between {810,829}

// if (a->hash < b->hash)
// PG_RETURN_INT32(-1);
// else if (a->hash > b->hash)
// PG_RETURN_INT32(1);

// if (a->nelements < b->nelements)
// PG_RETURN_INT32(-1);
// else if (a->nelements > b->nelements)
// PG_RETURN_INT32(1);

// Assert(a->nelements == b->nelements);

So hashset_cmp will directly compare int array. the above queries works.

{int4hashset_equals,int4hashset_neq} two special cases of hashset_cmp.
maybe we can just wrap it just like int4hashset_le?

now store 10 element int4hashset need 99 bytes, similar one dimension
bigint array with length 10, occupy 101 byte....

in int4hashset_send, newly add struct attributes/member {load_factor
growth_factor ncollisions hash} also need send to buf?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-06-19 01:10:59 Re: [PATCH] hstore: Fix parsing on Mac OS X: isspace() is locale specific
Previous Message Michael Paquier 2023-06-18 23:57:17 Re: Deleting prepared statements from libpq.