Re: Adding basic NUMA awareness

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2026-06-25 12:19:36
Message-ID: CAKZiRmwsjwDatuV5JfUaAbkTxjnUaL_UK3SkFTJee-jo-u_N7Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 24, 2026 at 10:29 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:

Hi,

> Here's an updated patch series, with only minor changes to fix the mbind
> issues:
[..]
> I've also included Jakub's "goodies" patch with the additional GUCs.
> Those seem potentially useful to development.

Cool!

> I have some results from a new round of benchmarks, and it's a bit
> disappointing. Or rather, there seem to be some issues that I can't
> figure out, causing regressions.
[..]
> This chart is for median latency (in milliseconds):
>
> clients master 0003 0004 0003/on 0004/on
> -------------------------------------------------------------
> 1 12767 12582 14509 12807 15307
> 8 14383 14355 14149 14069 16165
> 32 14756 15198 14836 14984 17128
> --------------------------------------------------------
> 1 103% 114% 100% 120%
> 8 101% 98% 98% 112%
> 32 102% 101% 102% 116%
>

I haven't tried it yet, however I can spot some things:

No crystal clear idea why, but in the script I can see that you have
io_method = io_uring and are not dropping_caches, so IMHO it is too complex
interaction at this stage.

One hint: such setup is going to be problematic for proving numbers.
On the meeting I've tried to describe that I've been using io_method = sync
instead of 'worker' to get more predicitable results (together with echo 3
> drop_caches), because then it is that backend's CPU/$NODE doing that
pread()/pwrite() -- or any other operating performing the load --
it is going to put that file onto that_specific_$NODE --
so even if you have sequence like:
pgbench -i
pg_ctl restart
pgbench -c XX

then pgbench -i even with shared_buffers_numa=on will spread into many
nodes the Buffers, yet after the restart the VFS cache portion of the data
will still reside on single specific $NODE that wrote it to the filesystem
(due to local-first-tocuh-affinity even for VFS cache), so any further reading:
VFS cache --pread()--> s_b will take the hit of remote interconnect with
some probablity depending on where the new backends are running. Also
with worker it is even worse as we have those memory queue in between. I
think we even can have this:

file in VFS cache @ node0 --because of first touch policy (pgbench -i/prewarm)
io worker @ node1 --hits latency from node0 and node2
shm io worker queue @ node2 --well
client backend @ node 3 --puts into shm io worker on node2

Therefore I'm sticking to 'sync' to ease the pain... but with uring, I suspect
the situation is kind of similiar as we call io_uring_submit(), and we
may endup using io-wq kernel threads, and we have those submission/receive
(memory) queues that are located somewhere (that is on some node) too.

I think, we simply lack affinity for IO/NUMA for all io modes except sync, but
it's too early I suspect and way outside of scope for this $thread. I've
started thinking about it just last week, so... (but hopefully I'll be able
to ship helper fscachenuma.c to show layout of file across VFS caches on nodes
next week I hope)

Maybe some other suggestions:

Q1) Maybe some crosschecks first?
# balance should be equal between nodes even for baseline
# linux kernel has tendency to fit shm into one if it fits
find /sys/devices/system/node*/ -name 'free_hugepages' -exec
grep -H . {} \;

# check N0 and N1 even for default policy, might also reveal imbalance
# lots of RAM and too big huge_pages allows fitting whole shm
into just N0
# see point 4 from [1]
grep /anon_h /proc/$SOMEREALBACKENDPID/numa_maps

# then during pgbench -c run maybe those:
mpstat -N ALL 1
perf stat -a -e uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ \
--per-socket -I 1000 # or -M
memory_bandwidth_read,memory_bandwidth_write

(it might reveal that problem I've described above about io_method:
even with pgbench -c 1 you might be reading from all sockets/wrong sockets
instead of the correct one with affinity)

I like to pin CPUs to just one node for pgbench -c
<NUMBER_OF_CPUS/NUMBER_OF_NODES>
[to saturate one node only] and start server also with CPU pining
[or use this debug_numa_node to force] to that one node and cross-check
what's being read (using perf) and usually I have to disarm clock balancing
and override weights using pg_buffercache_set_partition() to also force
weight to stay local only - only then I'm able to outrun master. That's
how this idea was born that if we are only working on node $N with
some relations
then let's use only node $N's Buffers. But I have 90us:~280us
local vs remote
latency, so it's probably way easier for me to see results even without
disabling CPU-idle-states/turboboost/etc.

Q2) Dunno, but 0007 is not changing anything in runtime and you get huge
discrepeancy results when going 0006 -> 0007 for clients=1 (see
128% -> 112%).
Literally, as the same code but different rebuild (ELF image)
would be having
vastly different layout enough to cause perf issues?

Hopefully next week I'll try to repro those numbers to see if I can
help more.

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmzo9xnJSgO4b26DTZqPuObcQ-6ncay%2BmOEKs9rzCkegUA%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2026-06-25 12:26:56 RE: Add tests for concurrent DML retry paths in logical replication apply
Previous Message Maxime Schoemans 2026-06-25 12:19:09 Re: [PATCH] btree_gist: add cross-type integer operator support for GiST