| From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-07-02 09:24:21 |
| Message-ID: | CAKZiRmzYM_5e8wStGQVVr7_v62br-EbUbTecK6HBFAKOGMBCfQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Jun 30, 2026 at 2:51 PM Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
>
> On Mon, Jun 29, 2026 at 9:42 AM Jakub Wartak
> <jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
> >
> > On Thu, Jun 25, 2026 at 3:49 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > >
> > > >> I have some results from a new round of benchmarks, and it's a bit
> > > >> disappointing. Or rather, there seem to be some issues that I can't
> > > >> figure out, causing regressions.
> > > > [..]
> > > >> This chart is for median latency (in milliseconds):
> > > >>
> > > >> clients master 0003 0004 0003/on 0004/on
> > > >> -------------------------------------------------------------
> > > >> 1 12767 12582 14509 12807 15307
> > > >> 8 14383 14355 14149 14069 16165
> > > >> 32 14756 15198 14836 14984 17128
> > > >> --------------------------------------------------------
> > > >> 1 103% 114% 100% 120%
> > > >> 8 101% 98% 98% 112%
> > > >> 32 102% 101% 102% 116%
> > > >>
>
> [..lots of variables..]
>
> > > I'll try, but if you could try running some experiments on your own,
> > > that might be helpful.
> > [..]
> > > > Hopefully next week I'll try to repro those numbers to see if I can
> > > > help more.
> > > >
> > >
> > > Thank you! That'd be great.
> >
> > Yeah, I'll try my best, we'll see how it goes. Right now I've just dropped
> > that fscachenuma proggie to aid us in troubleshooting.
> >
> > -J.
> >
> > [0] - https://github.com/jakubwartakEDB/fscachenuma
>
> Hi Tomas,
>
> OK, so I've run couple of tests and modified run.sh and also tried to fix
> some inefficiencies spotted while testing this. Note the attached
> performance matrix is in TPS (so more is better). Raw results/CSV and
> scripts are attached too.
>
> * run2 = 2 workloads, partitioned pgbench_accounts
> * run3 = just pgbenchS w/o partitioning + warmup
> * run4 = semi-like pgbenchS w/o partitioning but 100k rows + warmup
>
[..]
>
> Stil on the table:
>
> 1. maybe even better balancing is possible (?), but this one is seems enough?
> I'm out of other ideas, well other than the
> "shared-relation-use-by-foreign-node" idea described much earlier (but
> I won't be able to pull that off), so I'm not entering this rabbit hole
> any deeper.
See below, seems like not needed (?)
> 2. Digging into io_method=worker optimizations (answering question: are they
> necessary?) Maybe I'll throw in run5 quite soon, this is going to be
> crucial to answer.
OK, I'm attaching are results from mine runs 5 and 6:
- only seqconcurrscans was tested, well because for other workloads io_worker
method was not getting load for those workers (only seq scans were offloaded)
- checksums were disabled, because IMHO that would be unfair comparision
(AFAIR there are offloaded)
- those optimizations for 0008 "optimized (numa=on, bal=on)" easily beat
"patched (numa=on, bal=on)" and seem to be crucial. We get like 1.2x-1.4x
across every io_method, but only with 0008.
- even when then doing just those logical fully cached reads from fully VFS
cached case, io_urings shines (I've added raw TPS number to show this,
compare across tables e.g. io_uring vs sync 13.378/8.993=1.487x for
io_uring with NUMA, but for master's for io_uring:sync it was just 8.79/7.389
= 1.189x without NUMA); seems like io_uring is more lightweight to show
more benefits of remote memory latencies
- there's some more juice to get out of the balancer for 0-reuse workloads
(but IMHO it's pointless to squeeze more, it's hard already)
- I was probably wrong when expecting that io_worker's worker processes/queues
should get NUMA affinity. They don't need to be apparently for me to see
benefits (maybe they could be and it would even better, but meh).
So with ruling io_method impact (I speculated earlier that his could be it),
this means that you were either hitting lack of opimizations needed from
0008 or were impacted by lack of drop_caches before the runs
> Maybe You could run those run*.sh with master vs inst-patchset/optimized?
> (I'm not sure, maybe there's even different factor at play too...)
This is seems to be crucial now, to double confirm the results / loaded-tested
on your hw with 0008. (but that hardware really needs to have effective latency
difference between at least 2 NUMA nodes -- Intel's mlc is good for this);
maybe also tweak those 125% inside 0008 to some other values, I've got 4 nodes,
so 100/4=25%)
> 3. Potentially mentioned earlier BAS strategies (forcing just use of local
> partitions for known-to-be-only-local-users: CTAS/VACCUM/etc), but I'm
> afarid that's not for me as I would certainly break/violate some
> invisible to me boundary.
And this one is still potentially on the table as nice thing to have.
-J.
| Attachment | Content-Type | Size |
|---|---|---|
| performance_report_runs56.html | text/html | 10.8 KB |
| runs_56.csv | text/csv | 14.9 KB |
| run6.sh | application/x-shellscript | 4.7 KB |
| run5.sh | application/x-shellscript | 4.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Xuneng Zhou | 2026-07-02 09:31:33 | Re: Deadlock detector fails to activate on a hot standby replica |
| Previous Message | Japin Li | 2026-07-02 09:11:35 | Re: Global temporary tables |