Re: Do we want a hashset type?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-05-31 14:53:23
Message-ID: 501f66fc-c113-a64c-3b98-7e1844403c65@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/31/23 16:09, Joel Jacobson wrote:
> Hi,
>
> I've been working with a social network start-up that uses PostgreSQL as
> their
> only database. Recently, they became interested in graph databases, largely
> because of an article [1] suggesting that a SQL database "just chokes"
> when it
> encounters a depth-five friends-of-friends query (for a million users having
> 50 friends each).
>
> The article didn't provide the complete SQL queries, so I had to buy the
> referenced book to get the full picture. It turns out, the query was a
> simple
> self-join, which, of course, isn't very efficient. When we rewrote the query
> using a modern RECURSIVE CTE, it worked but still took quite some time.
>
> Of course, there will always be a need for specific databases, and some
> queries
> will run faster on them. But, if PostgreSQL could handle graph queries
> with a
> Big-O runtime similar to graph databases, scalability wouldn't be such a big
> worry.
>
> Just like the addition of the JSON type to PostgreSQL helped reduce the hype
> around NoSQL, maybe there's something simple that's missing in
> PostgreSQL that
> could help us achieve the same Big-O class performance as graph
> databases for
> some of these type of graph queries?
>
> Looking into the key differences between PostgreSQL and graph databases,
> it seems that one is how they store adjacent nodes. In SQL, a graph can be
> represented as one table for the Nodes and another table for the Edges.
> For a friends-of-friends query, we would need to query Edges to find
> adjacent
> nodes, recursively.
>
> Graph databases, on the other hand, keep adjacent nodes immediately
> accessible
> by storing them with the node itself. This looks like a major difference in
> terms of how the data is stored.
>
> Could a hashset type help bridge this gap?
>
> The idea would be to store adjacent nodes as a hashset column in a Nodes
> table.
>

I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?

Presumably it'd store whole adjacent nodes, not just some sort of node
ID. So what if a node is adjacent to many other nodes? What if a node is
added/deleted/modified?

AFAICS the main problem is the lookups of adjacent nodes, generating
lot of random I/O etc. Presumably it's not that hard to keep the
"relational" schema with table for vertices/edges, and then an auxiliary
table with adjacent nodes grouped by node, possibly maintained by a
couple triggers. A bit like an "aggregated index" except the queries
would have to use it explicitly.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2023-05-31 15:14:26 Incremental View Maintenance, take 2
Previous Message Daniel Gustafsson 2023-05-31 14:12:42 Re: Refactor ssl tests to avoid using internal PostgreSQL::Test::Cluster methods