Re: Adding basic NUMA awareness

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2026-01-13 14:37:15
Message-ID: f2t6fd4sjhgp4afc6ifeu6mw3zzwjvaajfgfl4msc7jg3ome2w@fid7saha67bv
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-01-13 14:26:37 +0100, Jakub Wartak wrote:
> - so per above and in my opinion, both on master or all patchsets
> here, classic OLTP pgbench (-S) is way too CPU computation heavy even
> with -M prepared to see NUMA latency effects.

I don't think it's that it's too CPU computation heavy, it's that it's very
latency sensitive to a small number of cachelines (ProcArrayLock, buffer
mapping table locks, btree inner pages), which will fundamentally have to
reside on one of the nodes. For pgbench -S to benefit we'd first need to
address at the very least the btree root page contention.

> - the single seqscan "SELECT sum(abalance) FROM pgbench_accounts;"
> problem (or lack of it -- with single session) is that with standard
> master, you may end up having data on just a single NUMA node. If that
> system is idle and running just 1 instance of this query, I'm getting
> like ~750MB/s of memory reads from the socket where the data is
> located (again , much below the limit of the interconnect
> [3.8-4.2GB/s])

The limit of the interconnect should be pretty much irrelevant for a single
query, you're pretty much never going to hit that query.

On my ~8 year old workstation (2x Xeon Gold 5215), with slow-ish RAM:

./mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.11a
Command line parameters: --bandwidth_matrix

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 65173.3 34092.5
1 33726.3 71244.1

If you're seeing ~3.5GB/s, as your [2] indicates, something either is wrong
with that system, or it's so old that it's useless for benchmarking. That's
worse than single core node-node numbers I've gotten on 10yo hardware.

The reason you're just getting 750MB is presumably because you're *latency*
limited, not because you're bandwidth limited. The problem is that our
deforming code has a, currently, unpredictable memory access at the start that
cannot meaningfully be hidden by out of order execution (because it determines
the address of the first column to actually deform, which cannot be hidden by
speculative execution).

> - however the above might be simply not true on single-socket NUMA
> systems (EPYC)

EPYC supports both single and dual socket systems. And intel has
numa-within-a-socket too...

> or just more modern multi-socket (but still same chassis NUMA systems) - so
> EPYC again?(~500GB/s wild interconnect)?

I don't think EPYC, even in the newer iterations, has anywhere near a 500GB/s
interconnect. But it's really irrelevant, latency is the main factor, not
bandwidth.

> So technically I should be getting this ~7%..22% profit due to lower
> latency if I would be fetching just ONLY local memory (but with NUMA
> we are not doing it right? we are interleaving - so we hit all sockets
> most of the time to fetch data)

We should *not* be interleaving unnecessarily, precisely because of this. We
should use the partitioned clock sweep to default to using local memory as
long as possible.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Junwang Zhao 2026-01-13 14:38:49 Re: how to gate experimental features (SQL/PGQ)
Previous Message Peter Eisentraut 2026-01-13 14:16:22 how to gate experimental features (SQL/PGQ)