Re: Do we want a hashset type?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-08 10:19:07
Message-ID: 86f8a027-e6d4-65c0-7349-faa3301c8b8b@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/8/23 11:41, Joel Jacobson wrote:
> On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
>> Interesting, considering how dumb the the hash table implementation is.
>
> That's promising.
>

Yeah, not bad for sleep-deprived on-plane hacking ...

There's a bunch of stuff that needs to be improved to make this properly
usable, like:

1) better hash table implementation

2) input/output functions

3) support for other types (now it only works with int32)

4) I wonder if this might be done as an array-like polymorphic type.

5) more efficient storage format, with versioning etc.

6) regression tests

Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.

>>> I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
>>> However, I've probably misunderstood something, maybe I need to add some index or something.
>>> Even so, it's interesting it's apparently not fast "by default".
>>>
>>
>> No idea how to fix that, but it's rather suspicious.
>
> I've had a graph-db expert review my benchmark, and he suggested adding an index:
>
> CREATE INDEX FOR (n:User) ON (n.id)
>
> This did improve the execution time for Neo4j a bit, down from 819 ms to 528 ms, but PostgreSQL 299 ms is still a win.
>
> Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
> Note, in this benchmark, I only test the naive RECURSIVE CTE approach using array_agg(DISTINCT ...).
> And I couldn't even test the most connected user with Neo4j, the query never finish for some reason,
> so I had to test with a less connected user.
>

Interesting. I'd have expected the graph db to be much faster.

> The graph expert also said that other more realistic graph use-cases might be "multi-relational",
> and pointed me to a link: https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
> No idea how such multi-relational datasets would affect the benchmarks.
>

Not sure either, but I don't have ambition to improve everything at
once. If the hashset improves one practical use case, fine with me.

> I think we have already strong indicators that PostgreSQL with a hashset type will from a relative
> performance perspective, do just fine processing basic graph queries, even with large datasets.
>
> Then there will always be the case when users primarily write very different graph queries all day long,
> who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher or Gremlin.
>

Right. IMHO the query language is a separate thing, you still need to
evaluate the query somehow - which is where hashset applies.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-06-08 10:29:38 Re: running logical replication as the subscription owner
Previous Message Hannu Krosing 2023-06-08 10:15:58 Re: Let's make PostgreSQL multi-threaded