Re: Adding basic NUMA awareness

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2026-06-30 12:51:18
Message-ID: CAKZiRmz+0tYkoMAKq4Qoc6M1-ZhYFEJnJO23Q2Kf8eyhx3S4og@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 29, 2026 at 9:42 AM Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
>
> On Thu, Jun 25, 2026 at 3:49 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >
> > >> I have some results from a new round of benchmarks, and it's a bit
> > >> disappointing. Or rather, there seem to be some issues that I can't
> > >> figure out, causing regressions.
> > > [..]
> > >> This chart is for median latency (in milliseconds):
> > >>
> > >> clients master 0003 0004 0003/on 0004/on
> > >> -------------------------------------------------------------
> > >> 1 12767 12582 14509 12807 15307
> > >> 8 14383 14355 14149 14069 16165
> > >> 32 14756 15198 14836 14984 17128
> > >> --------------------------------------------------------
> > >> 1 103% 114% 100% 120%
> > >> 8 101% 98% 98% 112%
> > >> 32 102% 101% 102% 116%
> > >>

[..lots of variables..]

> > I'll try, but if you could try running some experiments on your own,
> > that might be helpful.
> [..]
> > > Hopefully next week I'll try to repro those numbers to see if I can
> > > help more.
> > >
> >
> > Thank you! That'd be great.
>
> Yeah, I'll try my best, we'll see how it goes. Right now I've just dropped
> that fscachenuma proggie to aid us in troubleshooting.
>
> -J.
>
> [0] - https://github.com/jakubwartakEDB/fscachenuma

Hi Tomas,

OK, so I've run couple of tests and modified run.sh and also tried to fix
some inefficiencies spotted while testing this. Note the attached
performance matrix is in TPS (so more is better). Raw results/CSV and
scripts are attached too.

* run2 = 2 workloads, partitioned pgbench_accounts
* run3 = just pgbenchS w/o partitioning + warmup
* run4 = semi-like pgbenchS w/o partitioning but 100k rows + warmup

One important modification in those run shell scripts is that they
clean page-cache (drop_cached) as mentioned earlier to avoid false results
where everything would on node#N after pgbench -i ran. Probably I did
not get any regressions you've got, because of this. Or better diff -u
run*.sh scripts.

The "inst-optimized" is just the same patchset (so "inst-patchset") + crude
attempt in 0008 to make further smooth out things and avoid regressions while
I've been working on this. 0008 does couple of things:

a. implements CPU/node caching instead quering it every single buffer. Even
if on x86_64 that is optimized by vdso/kernel to avoid the real syscall,
the semi-syscall tax seems to be visible when fetching lots of buffers.
128 is arbitrary and still kind of low (128*8kB=1MB, and we are doing
hundreths of MB/s; while rescheduling happened only every couple of
seconds).

b1. minimize the attempt to use other partittions till some threshold (
and then it relies on the scan-all-partitions)

b2. avoids selecting idle partitions (defined as avg_allocs/2) - if there
are low allocations there it is debatable if cache utilization is better
or sticking to lower latency is better (e.g. in some workloads buffer
reuse is close to 0, so lower latency is clearly better)

Results are attached, some observations:

0.There were vast differences in how pg_ctl is started (interleaved or not),
so I've decided in the end to show relative to both situations.

1.In run2/seqconcurrscans I've saturated my interconnect and that's why
it's giving 129-155% there. I don't have access physiscal hw, but I suspect
that modern 2socket EPYC5 has like ~614GB/s per socket RAM bandwidth,
but the max oneway bandwith of the interconnect is around ~220GB/s (
no way to provie it), so *IF* with hundreths of cores we would be able
fetch at this rate we could saturte modern hardware too that way (and
we birefly touched related topic: batched executor, accelerating it
so fast those effects could be more easily achieveable)

2.run3 has no partiitioning because according to perf and my eyes, it
spent time not on the buffers itself (thus it was way heavier on CPU
[partitioning] than on memory...), so that's how run3 was born without
partitions :D

3.The warmup is critical for run3/pgbenchS, as I've noticed that depending
on ${luck} if you start the "master" (baseline w/o interleaving) and pgbench
it right away everything might land on node0 (s_b, pagecache), so "master"
was basically cheating in benchmarks vs especially Your's patchset where
it was spreading way too soon. Having drop_caches, additional warump and
only then proper pgbench kind of reduces that luck-factor. In general I
think all runs with c=1 seem to have kind of low singal-to-noise ratio. I
was thinking about pinning to always stick to the same NUMA node from start
to win against master just for this c=1 scenarios, but "meh".

3b. in short for pgbench -S we can gain like 2-5%

4.run4 was made just to prove that workload fetching more buffers, than
the standard pgbench -S (1 row?), seems to be the key to prove
optimizations in 0008 (other than showing good benefits for seqconcurrscans
of course). So run4 just shows benefit compared to 0001-0007 alone.

Stil on the table:

1. maybe even better balancing is possible (?), but this one is seems enough?
I'm out of other ideas, well other than the
"shared-relation-use-by-foreign-node" idea described much earlier (but
I won't be able to pull that off), so I'm not entering this rabbit hole
any deeper.

2. Digging into io_method=worker optimizations (answering question: are they
necessary?) Maybe I'll throw in run5 quite soon, this is going to be
crucial to answer.

3. Potentially mentioned earlier BAS strategies (forcing just use of local
partitions for known-to-be-only-local-users: CTAS/VACCUM/etc), but I'm
afarid that's not for me as I would certainly break/violate some
invisible to me boundary.

Maybe You could run those run*.sh with master vs inst-patchset/optimized?
(I'm not sure, maybe there's even different factor at play too...)

-J.

Attachment Content-Type Size
performance_report_run2.html text/html 15.1 KB
numabenchhackersreview-2026-06-30.tgz application/x-compressed-tar 49.0 KB
performance_report_run3.html text/html 9.1 KB
v20260630-0008-0001-clock-sweep-cached-CPU-NUMA-node-and-.patch text/x-patch 6.3 KB
performance_report_run4.html text/html 9.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Ashutosh Sharma 2026-06-30 12:39:09 Re: Report bytes and transactions actually sent downtream