Re: Do we want a hashset type?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Do we want a hashset type?
Date: 2023-06-01 07:02:21
Message-ID: f4ad4ec9-160f-48a6-851d-b2b34db49ebc@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 31, 2023, at 18:59, Tomas Vondra wrote:
> How does storing just the IDs solves anything? Isn't the main challenge
> the random I/O when fetching the adjacent nodes? This does not really
> improve that, no?

I'm thinking of a recursive query where a lot of time is just spent following
all friends-of-friends, where the graph traversal is the heavy part,
where the final set of nodes are only fetched at the end.

> It's entirely possible I'm missing some important aspect. It'd be very
> useful to have a couple example queries illustrating the issue, that we
> could use to actually test different approaches.

Here is an example using a real anonymised social network.

wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
gunzip soc-pokec-relationships.txt.gz
CREATE TABLE edges (from_node INT, to_node INT);
\copy edges from soc-pokec-relationships.txt;
ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);
SET work_mem TO '1GB';
CREATE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
SELECT
ARRAY[5867::bigint] AS current,
ARRAY[5867::bigint] AS found,
0 AS depth
UNION ALL
SELECT
new_current,
found || new_current,
friends_of_friends.depth + 1
FROM
friends_of_friends
CROSS JOIN LATERAL (
SELECT
array_agg(DISTINCT edges.to_node) AS new_current
FROM
edges
WHERE
from_node = ANY(friends_of_friends.current)
) q
WHERE
friends_of_friends.depth < 3
)
SELECT
depth,
coalesce(array_length(current, 1), 0)
FROM
friends_of_friends
WHERE
depth = 3;
;

SELECT COUNT(*) FROM edges;
count
----------
30622564
(1 row)

SELECT COUNT(DISTINCT from_node) FROM edges;
count
---------
1432693
(1 row)

-- Most connected user (worst-case) is user 5867 with 8763 friends:
SELECT from_node, COUNT(*) FROM edges GROUP BY from_node ORDER BY COUNT(*) DESC LIMIT 1;
from_node | count
-----------+-------
5867 | 8763
(1 row)

-- Friend-of-friends query exactly at depth three:

EXPLAIN ANALYZE
SELECT * FROM friends_of_friends;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=6017516.90..6017517.60 rows=1 width=8) (actual time=2585.881..2586.334 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..6017516.90 rows=31 width=68) (actual time=0.005..2581.664 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=68) (actual time=0.002..0.002 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=200583.71..601751.66 rows=3 width=68) (actual time=645.036..645.157 rows=1 loops=4)
-> Nested Loop (cost=200583.71..601751.54 rows=3 width=68) (actual time=641.880..641.972 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=68) (actual time=0.001..0.002 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=200583.71..200583.72 rows=1 width=32) (actual time=850.997..850.998 rows=1 loops=3)
-> Bitmap Heap Scan on edges (cost=27656.38..196840.88 rows=1497133 width=4) (actual time=203.239..423.534 rows=3486910 loops=3)
Recheck Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Blocks: exact=117876
-> Bitmap Index Scan on edges_pkey (cost=0.00..27282.10 rows=1497133 width=0) (actual time=198.047..198.047 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Planning Time: 1.414 ms
Execution Time: 2588.288 ms
(19 rows)

I tested on PostgreSQL 15.2. For some reason I got a different slower query on HEAD:

SELECT * FROM friends_of_friends;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on friends_of_friends (cost=6576.67..6577.37 rows=1 width=8) (actual time=6412.693..6413.335 rows=1 loops=1)
Filter: (depth = 3)
Rows Removed by Filter: 3
CTE friends_of_friends
-> Recursive Union (cost=0.00..6576.67 rows=31 width=68) (actual time=0.008..6407.134 rows=4 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=68) (actual time=0.005..0.005 rows=1 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=219.05..657.64 rows=3 width=68) (actual time=1600.747..1600.934 rows=1 loops=4)
-> Nested Loop (cost=219.05..657.52 rows=3 width=68) (actual time=1594.906..1595.035 rows=1 loops=4)
-> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=68) (actual time=0.001..0.002 rows=1 loops=4)
Filter: (depth < 3)
Rows Removed by Filter: 0
-> Aggregate (cost=219.05..219.06 rows=1 width=32) (actual time=2118.105..2118.105 rows=1 loops=3)
-> Sort (cost=207.94..213.49 rows=2221 width=4) (actual time=1780.770..1925.853 rows=3486910 loops=3)
Sort Key: edges.to_node
Sort Method: quicksort Memory: 393217kB
-> Index Only Scan using edges_pkey on edges (cost=0.56..84.48 rows=2221 width=4) (actual time=0.077..762.408 rows=3486910 loops=3)
Index Cond: (from_node = ANY (friends_of_friends_1.current))
Heap Fetches: 0
Planning Time: 8.229 ms
Execution Time: 6446.421 ms
(20 rows)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2023-06-01 07:14:08 Re: Do we want a hashset type?
Previous Message Richard Guo 2023-06-01 02:30:16 Re: An inefficient query caused by unnecessary PlaceHolderVar