Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2026-06-25 13:49:00
Message-ID: 2b6abd8f-3659-4be1-9499-b4f00f973f22@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/25/26 14:19, Jakub Wartak wrote:
> On Wed, Jun 24, 2026 at 10:29 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> Hi,
>
>> Here's an updated patch series, with only minor changes to fix the mbind
>> issues:
> [..]
>> I've also included Jakub's "goodies" patch with the additional GUCs.
>> Those seem potentially useful to development.
>
> Cool!
>
>> I have some results from a new round of benchmarks, and it's a bit
>> disappointing. Or rather, there seem to be some issues that I can't
>> figure out, causing regressions.
> [..]
>> This chart is for median latency (in milliseconds):
>>
>> clients master 0003 0004 0003/on 0004/on
>> -------------------------------------------------------------
>> 1 12767 12582 14509 12807 15307
>> 8 14383 14355 14149 14069 16165
>> 32 14756 15198 14836 14984 17128
>> --------------------------------------------------------
>> 1 103% 114% 100% 120%
>> 8 101% 98% 98% 112%
>> 32 102% 101% 102% 116%
>>
>
> I haven't tried it yet, however I can spot some things:
>
> No crystal clear idea why, but in the script I can see that you have
> io_method = io_uring and are not dropping_caches, so IMHO it is too complex
> interaction at this stage.
>

By caches I assume you mean page cache? The test is meant so simulate a
cached system, copying data between shared buffers and page cache. My
expectation is that once we start hitting I/O, it'll completely hide
most differences due to NUMA.

> One hint: such setup is going to be problematic for proving numbers.
> On the meeting I've tried to describe that I've been using io_method = sync
> instead of 'worker' to get more predicitable results (together with echo 3
>> drop_caches), because then it is that backend's CPU/$NODE doing that
> pread()/pwrite() -- or any other operating performing the load --
> it is going to put that file onto that_specific_$NODE --
> so even if you have sequence like:
> pgbench -i
> pg_ctl restart
> pgbench -c XX
>

Hmm, I missed that point during the meeting. I wonder if "worker" is
interacting with the NUMA somehow (I mean, does it load it into the
right node?). But I'm using io_uring, and it's not clear to me why sync
would be better for benchmarking?

Ultimately, we need to make sure it works well with io_uring anyway,
right? Even if "sync" happens to be better for benchmarking (or even for
NUMA stuff), we have to make it work with worker/io_uring. Because
that's what practical systems use.

> then pgbench -i even with shared_buffers_numa=on will spread into many
> nodes the Buffers, yet after the restart the VFS cache portion of the data
> will still reside on single specific $NODE that wrote it to the filesystem
> (due to local-first-tocuh-affinity even for VFS cache), so any further reading:
> VFS cache --pread()--> s_b will take the hit of remote interconnect with
> some probablity depending on where the new backends are running. Also
> with worker it is even worse as we have those memory queue in between. I
> think we even can have this:
>
> file in VFS cache @ node0 --because of first touch policy (pgbench -i/prewarm)
> io worker @ node1 --hits latency from node0 and node2
> shm io worker queue @ node2 --well
> client backend @ node 3 --puts into shm io worker on node2
>
> Therefore I'm sticking to 'sync' to ease the pain... but with uring, I suspect
> the situation is kind of similiar as we call io_uring_submit(), and we
> may endup using io-wq kernel threads, and we have those submission/receive
> (memory) queues that are located somewhere (that is on some node) too.
>
> I think, we simply lack affinity for IO/NUMA for all io modes except sync, but
> it's too early I suspect and way outside of scope for this $thread. I've
> started thinking about it just last week, so... (but hopefully I'll be able
> to ship helper fscachenuma.c to show layout of file across VFS caches on nodes
> next week I hope)
>

Ah, you're suggesting the page cache stuff will be placed on a single
NUMA node? That may be true, it's a good point. And maybe it could skew
the results in a bad way. Still, that would be the case even without the
NUMA partitioning, no?

> Maybe some other suggestions:
>
> Q1) Maybe some crosschecks first?
> # balance should be equal between nodes even for baseline
> # linux kernel has tendency to fit shm into one if it fits
> find /sys/devices/system/node*/ -name 'free_hugepages' -exec
> grep -H . {} \;
>
> # check N0 and N1 even for default policy, might also reveal imbalance
> # lots of RAM and too big huge_pages allows fitting whole shm
> into just N0
> # see point 4 from [1]
> grep /anon_h /proc/$SOMEREALBACKENDPID/numa_maps
>
> # then during pgbench -c run maybe those:
> mpstat -N ALL 1
> perf stat -a -e uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ \
> --per-socket -I 1000 # or -M
> memory_bandwidth_read,memory_bandwidth_write
>
> (it might reveal that problem I've described above about io_method:
> even with pgbench -c 1 you might be reading from all sockets/wrong sockets
> instead of the correct one with affinity)
>

I'll try, but if you could try running some experiments on your own,
that might be helpful.

> I like to pin CPUs to just one node for pgbench -c
> <NUMBER_OF_CPUS/NUMBER_OF_NODES>
> [to saturate one node only] and start server also with CPU pining
> [or use this debug_numa_node to force] to that one node and cross-check
> what's being read (using perf) and usually I have to disarm clock balancing
> and override weights using pg_buffercache_set_partition() to also force
> weight to stay local only - only then I'm able to outrun master. That's
> how this idea was born that if we are only working on node $N with
> some relations
> then let's use only node $N's Buffers. But I have 90us:~280us
> local vs remote
> latency, so it's probably way easier for me to see results even without
> disabling CPU-idle-states/turboboost/etc.
>
> Q2) Dunno, but 0007 is not changing anything in runtime and you get huge
> discrepeancy results when going 0006 -> 0007 for clients=1 (see
> 128% -> 112%).
> Literally, as the same code but different rebuild (ELF image)
> would be having
> vastly different layout enough to cause perf issues?
>
> Hopefully next week I'll try to repro those numbers to see if I can
> help more.
>

Thank you! That'd be great.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message ZizhuanLiu X-MAN 2026-06-25 13:50:59 Re: disallow alter individual column if partition key contains wholerow reference
Previous Message Nitin Jadhav 2026-06-25 13:48:01 Re: [PATCH] Fix minRecoveryPoint not advanced past checkpoint in CreateRestartPoint