| From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-06-16 08:16:00 |
| Message-ID: | CAKZiRmzo9xnJSgO4b26DTZqPuObcQ-6ncay+mOEKs9rzCkegUA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Jun 5, 2026 at 2:52 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> Hi,
Hi Tomas, thanks for working on this.
> Here's an updated version of the NUMA patch series, based on some recent
> discussions about this (some at pgconf.dev, but not only that),
[..]
1. 005 says:
+ * XXX We should enforce this in bufmgr.c, when initializing the partitions.
+ */
+#define MAX_BUFFER_PARTITIONS 32
but there isn't direct any check for checking if pg_numa_get_max_node() ->
numa_max_node() is not getting higher than allowed here. In theory this could
happen I think if ClockSweepPartitionIndex() would return
numa = numa_node_of_cpu()
on some hypothethical very high-end setup (with plenty of sub-NUMA nodes)
and that would cause accesing .balance[] without bounds.
2. If we have in 004 struct ClockSweep with nextVictimBuffer, shouldn't
this be padded/aligned somehow later in BufferStrategyControl which does
ClockSweep sweeps[FLEXIBLE_ARRAY_MEMBER];
to avoid contention/false sharing? (comments says it should be but it
doesn't seem so?), maybe the comment should be TODO for now? I have not
quantified any potential benefit
With pahole after some hassle I've got:
struct ClockSweep {
slock_t clock_sweep_lock; /* 0 1 */
/* XXX 3 bytes hole, try to pack */
int32 node; /* 4 4 */
int32 firstBuffer; /* 8 4 */
int32 numBuffers; /* 12 4 */
pg_atomic_uint32 nextVictimBuffer; /* 16 4 */
uint32 completePasses; /* 20 4 */
pg_atomic_uint32 numBufferAllocs; /* 24 4 */
pg_atomic_uint32 numRequestedAllocs; /* 28 4 */
pg_atomic_uint64 numTotalAllocs; /* 32 8 */
pg_atomic_uint64 numTotalRequestedAllocs; /* 40 8 */
uint8 balance[32]; /* 48 32 */
/* size: 80, cachelines: 2, members: 11 */
/* sum members: 77, holes: 1, sum holes: 3 */
/* last cacheline: 16 bytes */
};
maybe with smaller MAX_BUFFER_PARTITIONS we could pack this into size=64 ?
3. In 004 sched_getcpu() is used and mentioned how to check if it is available
But my $0.02 (maybe not that important), but I've at least saw once where
(on EC2?) some clock_gettime() was very slow and that was because it was not
available in VDSO. It's usually some mix of kernel <-> arch <-> libc (not
always glibc?) compatibility matrix issue. My worry is that StrategyGetBuffer()
-> ChooseClockSweep() -> ClockSweepPartitionIndex() -> sched_getcpu() would be
available, but slow and it would mean real syscall price (and that's not once
there per buffer). I'm also somehow thinking other platforms (FreeBSD comes to
mind, but I haven't checked further). The point is: wouldn't it be cheaper
that to be refreshed from time to time instead otherwise we risk some slow
code on non-x86_64, but I doubt how proliferated is e.g. ARM64 with NUMA..
Or alternative is to have pg_test_numa proggie and this would be measuring
certain things about NUMA including timing of sched_getcpu (just like
pg_test_timing does for time), at least that could explain why somebody's
system/platform is slow.
4. Patch has problem (without fix for #8) that when number of available huge
pages in the OS is greatly higher than shared_memory_size_in_huge_pages it
will use only first NUMA node. This might be a problem when starting mulitple
DBs (they will occupy first available NUMA):
### with s_b=8GB and nr_hugepages=1500 it's OK
# find /sys/devices/system/node/ -name nr_hugepages -exec grep -H . {}
\; | grep 2048 | sort
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1250
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1250
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:1250
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:1250
## note the correct split below for N0/N1..
# grep huge /proc/`pgrep -f /usr/pgsql19/bin/postgres`/numa_maps
7fb1b4400000 default file=/anon_hugepage\040(deleted) huge dirty=4269
mapmax=6 N0=1250 N1=1250 N2=519 N3=1250 kernelpagesize_kB=2048
### still s_b=8GB but nr_hugepages = 19000 (~37GB), it ends all on N0=4269
# find /sys/devices/system/node/ -name nr_hugepages -exec grep -H . {}
\; | grep 2048 | sort
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:4750
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:4750
/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:4750
/sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:4750
## all on N0...
# grep huge /proc/`pgrep -f /usr/pgsql19/bin/postgres`/numa_maps
7ff3a7a00000 default file=/anon_hugepage\040(deleted) huge dirty=4269
mapmax=6 N0=4269 kernelpagesize_kB=2048
I was even thinking go to lengths and add code for inspecting that /sys on
some later date that the kernel NUMA hugepages are really distributed
on the nodes as they should be (it's easy to end up on just 1 node out of
many; allocating via sysctl -w <higher> and then <lower> allocation is easy
way to force hugepages just to 1 node instead of many :o). I've hit the
problem multiple times, so we should bail out if we want NUMA and the
Buffer Blocks were just put on 1 node (instead of many).
5. In 005 we could mention more clealry what's the difference between
those 3: numRequestedAllocs, numTotalAllocs, numTotalRequestedAllocs
in the defintion to make it easier to read, maybe copy-cat those earlier
descriptions there too as we already have:
+ * The balancing happens in intervals - it adjusts future allocations
+ * based on stats about recent allocations, namely:
+ *
+ * - numBufferAllocs - number of allocations served by a partition
+ *
+ * - numRequestedAllocs - number of allocatios requested in a partition
6. While at it, it would be helpful if we could reset the
pg_buffercache_partitions stats in some way (pg_buffercache_partitions
is very usefull).. or is there way to plug into main pg_reset functions?
7. If I add basic error checking for mbind() then it complains a lot, like
below with annotated strace -ffe mbind to show the point:
[pid 2856] mbind(0x7fd8d0e00000, 2145386496, MPOL_BIND, [], 0,
MPOL_MF_MOVE) = -1 EINVAL (Invalid argument)
WARNING: mbind(): Invalid argument
WARNING: buffers descriptors for node 0 not well aligned
[0x7fd8cccf5000,0x7fd8cdcf4fc1] aligned
[0x7fd8cce00000,0x7fd8cdc00000]
[pid 2856] mbind(0x7fd8cce00000, 14680064, MPOL_BIND, [], 0,
MPOL_MF_MOVE) = -1 EINVAL (Invalid argument)
WARNING: mbind(): Invalid argument
WARNING: buffers for node 1 not well aligned
[0x7fd950cf5000,0x7fd9d0cf5000] aligned
[0x7fd950e00000,0x7fd9d0c00000]
[pid 2856] mbind(0x7fd950e00000, 2145386496, MPOL_BIND,
0x5589057ded00, 1, MPOL_MF_MOVE) = -1 EINVAL (Invalid argument)
WARNING: mbind(): Invalid argument
WARNING: buffers descriptors for node 1 not well aligned
[0x7fd8cdcf5000,0x7fd8cecf4fc1] aligned
[0x7fd8cde00000,0x7fd8cec00000]
[..]
but with pg_numa.c fixed like below (node should be size):
ret = mbind(startptr, (endptr - startptr),
- MPOL_BIND, nodemask->maskp, node, MPOL_MF_MOVE);
+ MPOL_BIND, nodemask->maskp,
nodemask->size, MPOL_MF_MOVE);
it doesn't report errors anymore and suprisngly hugepages in numa_maps are
altered from:
7fb1b4400000 default file=/anon_hugepage\040(deleted) huge dirty=4269
mapmax=6 N0=1250 N1=1250 N2=519 N3=1250 kernelpagesize_kB=2048
to explicit "binds":
7f8540000000 default file=/anon_hugepage\040(deleted) huge dirty=25
mapmax=6 N0=25 kernelpagesize_kB=2048
7f8543200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=3 N0=7 kernelpagesize_kB=2048
7f8544000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=2 N0=1 kernelpagesize_kB=2048
7f8544200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=2 N1=7 kernelpagesize_kB=2048
7f8545000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=2 N0=1 kernelpagesize_kB=2048
7f8545200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=2 N2=7 kernelpagesize_kB=2048
7f8546000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=2 N0=1 kernelpagesize_kB=2048
7f8546200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=2 N3=7 kernelpagesize_kB=2048
7f8547000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=3 N0=1 kernelpagesize_kB=2048
7f8547200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=1023
mapmax=2 N0=1023 kernelpagesize_kB=2048
7f85c7000000 default file=/anon_hugepage\040(deleted) huge dirty=1
N0=1 kernelpagesize_kB=2048
7f85c7200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1023
N1=1023 kernelpagesize_kB=2048
7f8647000000 default file=/anon_hugepage\040(deleted) huge dirty=1
N1=1 kernelpagesize_kB=2048
7f8647200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=1023
N2=1023 kernelpagesize_kB=2048
7f86c7000000 default file=/anon_hugepage\040(deleted) huge dirty=1
N3=1 kernelpagesize_kB=2048
7f86c7200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=1023
N3=1023 kernelpagesize_kB=2048
7f8747000000 default file=/anon_hugepage\040(deleted) huge dirty=117
mapmax=6 N2=117 kernelpagesize_kB=2048
so lots of VMAs were created (it could affect performance in some way, I think
for sure it would affect for worse fork() rates by postmaster for new conns).
To me it looks like there's plenty of "N[0..3]=1" with "default" indicating
single HP page being somehow left/missed in address calculations, but I
haven't pressed this harder.
NOTE: the patch works even without this fix, but I believe if got non-0 we
cannot reliably trust the optimizer memory layout has been deployed (I suspect
it's some kind luck it sharded the shm based on number of hugepages available)
> questions
> ---------
>
> At this point, my main question is whether there's a better way to
> partition clock-sweep and/or do the balancing of allocations between
> partitions. I believe it does work, but I have a feeling there might be
> a more elegant way to do this kind of stuff (like an established
> balancing algorithm of some sort).
8. The crux of this email and stuff I wanted to further discuss, when server
is started with on this 4-NUMA box with
* numactl --cpunodebind=0 pg_ctl start # so that all backends fork()ing will be
on node#0
* the shm split onto 4 nodes properly
* s_b still just 8GB (with ideal split),
* DB size ~15GB with 8 pgbench partitions (and fully in VFS cache)
* pgbench -c 8 -j 8 postgres -T 20 -P 1 -f seqconcurrscans.pgb with:
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;
* mpstat repors correctly just node#0 used
a. with the patch for GUCs with numa on and defaults two clocksweep settings
on, I'm getting:
latency average = 3252.254 ms
latency stddev = 72.011 ms
b. with debug_clocksweep_balance=off, I'm realiably getting
latency average = 2688.742 ms
latency stddev = 61.738 ms
so IMHO clocksweep partitioning is cool, but if we are discussing the current
balancing logic leaves some juice on the table from the most optimized variant
(~1.2x) with ~90ns:270ns (local vs remote latency). In the picture above it
was 8 backends accessing 8x 1.6GB tables (lower than NBuffers / 4).
Dunno if it should be optimized further, certainly we'll get reports from
quick benchmarks run by people that PG 20 could be *slower* because.. well,
they got (sub)optimal layout during startup (all HP on 1 node and some
query hitting just that one query with local affinity and this is visible
to naked eye). I was re-reading thread and Andres also wrote "We should use
the partitioned clock sweep to default to using local memory as long as
possible."
So two further ideas:
I. BufferAccessStrategy: we could derrive affinity from the BAS strategy
itself, couldn't we? If we are using capped ring buffer, we could indicate
that we want it just from local node as priority disregarding weights (?).
Same goes for BAS_VACUUM (why would one it on remote NUMA?). With BAS_BULKREAD
there would be some potential issue with sync-scan-table code though.
With BAS_BULKWRITE e.g. CTAS/CREATE INDEX it makes lot of sense too.
prewarm could be hacked to use some new special BAS_DISTRIBUTE or something
for ideal distribution amongst all NUMA nodes.
II. what if we could track if the relation is just all-local-access?
Another idea is that if we would know that's it's just us working on some
relation (created by us; or it's not being touched remotley) then we could
ask for local-only memory affinity. So something like this:
a. in case of locally-only access rels =>
ask for local memory first
if that fails failback to weighted RR (so to to weights)
b. in case other rels => weighted RR (so to to weights) directly
The tracking of the fact that Buffer was accessed just locally or remotley
itself is not hard to imagine (by using some free "bits" in BufferDesc.
"state" where refcount/usagecount itself are stored, well at least 4 bits
for my 4 nodes, but there's plenty of left there), I have some PoC for
that but that's just per-Buffer tracking of "was this Buffer accessed by
remote nodes", but I'm completley lost how to make transition to the
is-the-relation-being-accessed-accross-NUMA-nodes info to drive such
optimization (we would need some shared infra just for tracking such info;
assuming up to 2^31 or 2^32 relations [OID?] and just using at least >=
2..4 bits, that's already huge number: we are talking GBs of shm mem).
BTW: I've been experimenting with this patchset and added couple of things
(see attached), and with I'm able to get optimal just by forcing affinity
too using that earlier bench:
latency average = 2512.929 ms
latency stddev = 97.525 ms
and that was with pure 100% affinity to my local node:
select pg_buffercache_set_partition(0, '{100,0,0,0}');
debug_clocksweep_balance_recalc=off
debug_clocksweep_balance=on
debug_clocksweep_scan_all_partitions=on
(so it's another proof that code is fine, it's just algorithm that would
have to adjusted)
For benchmarks with pgbench -S for 100% local affinity vs 100% remote
(I can do that with that pg_buffercache_set_partition() of mine), I'm
getting just +/- 1-2k TPS (42-43k TPS vs 41-42k TPS), so not much, __but__
I've spotted some another bug in from where we are fetching memory from
unoptimal places if we are not on node#0, I'll need to dig into that more
though. Another thing is that pgbench -S runs are much less demanding in
terms of memory bandwidth used (under <1GB/s here vs 6-8GB/s for
seqconcurrscans.sql using the same amout of cores).
> The other thing I need to verify is how this behaves with
> kernel.nr_hugepages. With some earlier versions it was easy to end in a
> situation where everything seemed to work, but then much later the
> kernel realized it does not have enough huge pages on a particular NUMA
> node and crashed with a segfault (or was it sigbus?).
It was SIGBUS and with this patchset I think we are fine: I have never
witnessed this one crashing with SIGBUS.
> Of course, the other question is performance validation - does it even
> help? I plan to repeat the various experiments mentioned in this thread
> (by Andres and others) on available NUMA machines. But if someone has an
> idea for another benchmark (and/or what metric to measure, not just the
> usual duration), let me know.
See above, but I think we would have to fix at at least: mbind() failure,
and those VMAs disconnected regions.
-J.
| Attachment | Content-Type | Size |
|---|---|---|
| vXXX1-0001-Add-parttioned-clocksweep-and-NUMA-goodies.cfbotignorepatch | application/octet-stream | 14.6 KB |
| mbind_check_errcode.cfbotignorepatch | application/octet-stream | 871 bytes |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Smith | 2026-06-16 08:23:25 | Re: Support EXCEPT for TABLES IN SCHEMA publications |
| Previous Message | jian he | 2026-06-16 08:08:39 | Re: Add SPLIT PARTITION/MERGE PARTITIONS commands |