Re: Prefetch the next tuple's memory during seqscans

From: Andres Freund <andres(at)anarazel(dot)de>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Prefetch the next tuple's memory during seqscans
Date: 2022-11-02 17:25:44
Lists: pgsql-hackers


On 2022-11-01 20:00:43 -0700, Andres Freund wrote:
> I suspect that prefetching in heapgetpage() would provide gains as well, at
> least for pages that aren't marked all-visible, pretty common in the real
> world IME.

Attached is an experimental patch/hack for that. It ended up being more
beneficial to make the access ordering more optimal than prefetching the tuple
contents, but I'm not at all sure that's the be-all-end-all.

I separately benchmarked pinning the CPU and memory to the same socket,
different socket and interleaving memory.

I did this for HEAD, your patch, your patch and mine.

BEGIN; DROP TABLE IF EXISTS large; CREATE TABLE large(a int8 not null, b int8 not null default '0', c int8); INSERT INTO large SELECT generate_series(1, 50000000);COMMIT;

server is started with
local: numactl --membind 1 --physcpubind 10
remote: numactl --membind 0 --physcpubind 10
interleave: numactl --interleave=all --physcpubind 10

benchmark stared with:
psql -qX -f ~/tmp/prewarm.sql && \
pgbench -n -f ~/tmp/seqbench.sql -t 1 -r > /dev/null && \
perf stat -e task-clock,LLC-loads,LLC-load-misses,cycles,instructions -C
10 \
pgbench -n -f ~/tmp/seqbench.sql -t 3 -r

SELECT sum(a), sum(b), sum(c) FROM large;
SELECT sum(c) FROM large;

branch memory time s miss %
head local 31.612 74.03
david local 32.034 73.54
david+andres local 31.644 42.80
andres local 30.863 48.05

head remote 33.350 72.12
david remote 33.425 71.30
david+andres remote 32.428 49.57
andres remote 30.907 44.33

head interleave 32.465 71.33
david interleave 33.176 72.60
david+andres interleave 32.590 46.23
andres interleave 30.440 45.13

It's cool seeing how doing optimizing heapgetpage seems to pretty much remove
the performance difference between local / remote memory.

It makes some sense that David's patch doesn't help in this case - without
all-visible being set the tuple headers will have already been pulled in for
the HTSV call.

I've not yet experimented with moving the prefetch for the tuple contents from
David's location to before the HTSV. I suspect that might benefit both


Andres Freund

