| From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-06-29 07:42:49 |
| Message-ID: | CAKZiRmzigM9XOz6m0K070ZVSmaKpjnVR4CZZENkfeH-rNv78wA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, Jun 25, 2026 at 3:49 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> >> I have some results from a new round of benchmarks, and it's a bit
> >> disappointing. Or rather, there seem to be some issues that I can't
> >> figure out, causing regressions.
> > [..]
> >> This chart is for median latency (in milliseconds):
> >>
> >> clients master 0003 0004 0003/on 0004/on
> >> -------------------------------------------------------------
> >> 1 12767 12582 14509 12807 15307
> >> 8 14383 14355 14149 14069 16165
> >> 32 14756 15198 14836 14984 17128
> >> --------------------------------------------------------
> >> 1 103% 114% 100% 120%
> >> 8 101% 98% 98% 112%
> >> 32 102% 101% 102% 116%
> >>
> >
> > I haven't tried it yet, however I can spot some things:
> >
> > No crystal clear idea why, but in the script I can see that you have
> > io_method = io_uring and are not dropping_caches, so IMHO it is too complex
> > interaction at this stage.
> >
>
> By caches I assume you mean page cache? The test is meant so simulate a
> cached system, copying data between shared buffers and page cache. My
> expectation is that once we start hitting I/O, it'll completely hide
> most differences due to NUMA.
No, it wont completley hide it, those differences at least here still matter
(AFAIR right now like +/- 10% here)
> > One hint: such setup is going to be problematic for proving numbers.
> > On the meeting I've tried to describe that I've been using io_method = sync
> > instead of 'worker' to get more predicitable results (together with echo 3
> >> drop_caches), because then it is that backend's CPU/$NODE doing that
> > pread()/pwrite() -- or any other operating performing the load --
> > it is going to put that file onto that_specific_$NODE --
> > so even if you have sequence like:
> > pgbench -i
> > pg_ctl restart
> > pgbench -c XX
> >
>
> Hmm, I missed that point during the meeting. I wonder if "worker" is
> interacting with the NUMA somehow (I mean, does it load it into the
> right node?). But I'm using io_uring, and it's not clear to me why sync
> would be better for benchmarking?
>
> Ultimately, we need to make sure it works well with io_uring anyway,
> right? Even if "sync" happens to be better for benchmarking (or even for
> NUMA stuff), we have to make it work with worker/io_uring. Because
> that's what practical systems use.
Yes, we need to make work with more advanced, but I don't think we are there
yet (we'll need some more patches in orde rto demonstrate it reliably).
> > then pgbench -i even with shared_buffers_numa=on will spread into many
> > nodes the Buffers, yet after the restart the VFS cache portion of the data
> > will still reside on single specific $NODE that wrote it to the filesystem
> > (due to local-first-tocuh-affinity even for VFS cache),
> > [.. blabla , use io_method=sync ]
> >
>
> Ah, you're suggesting the page cache stuff will be placed on a single
> NUMA node? That may be true, it's a good point. And maybe it could skew
> the results in a bad way.
I've just published [0], see for yourself:
This happens especiall after pgbench -i, so:
pgbench -i # pagecache placement on one NUMA node
pg_ctl restart
pgbench -c XX
is day and night different than let's say:
pgbench -i
echo 3 > drop_caches
pg_ctl restart
pgbench -c XX # pagecache placement happens by many backends
# potentially many NUMA nodes
> Still, that would be the case even without the NUMA partitioning, no?
Right, in my experience we should not benchmark against master started
with the default pg_ctl (that's is without numactl --interleave=all) because
it is confusing to reason about it due how the s_b could laid out without
that interleaving. I mean later we can switch to that default, but IMHO not
yet.
> > Maybe some other suggestions:
> >
> > Q1) Maybe some crosschecks first?
> > # balance should be equal between nodes even for baseline
> > # linux kernel has tendency to fit shm into one if it fits
> > find /sys/devices/system/node*/ -name 'free_hugepages' -exec
> > grep -H . {} \;
> >
> > # check N0 and N1 even for default policy, might also reveal imbalance
> > # lots of RAM and too big huge_pages allows fitting whole shm
> > into just N0
> > # see point 4 from [1]
> > grep /anon_h /proc/$SOMEREALBACKENDPID/numa_maps
> >
> > # then during pgbench -c run maybe those:
> > mpstat -N ALL 1
> > perf stat -a -e uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ \
> > --per-socket -I 1000 # or -M
> > memory_bandwidth_read,memory_bandwidth_write
> >
> > (it might reveal that problem I've described above about io_method:
> > even with pgbench -c 1 you might be reading from all sockets/wrong sockets
> > instead of the correct one with affinity)
> >
>
> I'll try, but if you could try running some experiments on your own,
> that might be helpful.
[..]
> > Hopefully next week I'll try to repro those numbers to see if I can
> > help more.
> >
>
> Thank you! That'd be great.
Yeah, I'll try my best, we'll see how it goes. Right now I've just dropped
that fscachenuma proggie to aid us in troubleshooting.
-J.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andrey Borodin | 2026-06-29 07:46:57 | Re: Why clearing the VM doesn't require registering vm buffer in wal record |
| Previous Message | Paul A Jungwirth | 2026-06-29 07:38:34 | Re: [PATCH] Fix null pointer dereference in PG19 |