| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-06-16 12:39:45 |
| Message-ID: | 86e7e218-9013-4cdb-9ed2-dfda49640b4d@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 6/16/26 10:16, Jakub Wartak wrote:
> On Fri, Jun 5, 2026 at 2:52 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> Hi,
>
> Hi Tomas, thanks for working on this.
>
>> Here's an updated version of the NUMA patch series, based on some recent
>> discussions about this (some at pgconf.dev, but not only that),
> [..]
>
> 1. 005 says:
>
> + * XXX We should enforce this in bufmgr.c, when initializing the partitions.
> + */
> +#define MAX_BUFFER_PARTITIONS 32
>
> but there isn't direct any check for checking if pg_numa_get_max_node() ->
> numa_max_node() is not getting higher than allowed here. In theory this could
> happen I think if ClockSweepPartitionIndex() would return
> numa = numa_node_of_cpu()
> on some hypothethical very high-end setup (with plenty of sub-NUMA nodes)
> and that would cause accesing .balance[] without bounds.
>
Yes, this should be capped to the MAX_BUFFER_PARTITIONS.
> 2. If we have in 004 struct ClockSweep with nextVictimBuffer, shouldn't
> this be padded/aligned somehow later in BufferStrategyControl which does
> ClockSweep sweeps[FLEXIBLE_ARRAY_MEMBER];
> to avoid contention/false sharing? (comments says it should be but it
> doesn't seem so?), maybe the comment should be TODO for now? I have not
> quantified any potential benefit
>
> With pahole after some hassle I've got:
> struct ClockSweep {
> slock_t clock_sweep_lock; /* 0 1 */
>
> /* XXX 3 bytes hole, try to pack */
>
> int32 node; /* 4 4 */
> int32 firstBuffer; /* 8 4 */
> int32 numBuffers; /* 12 4 */
> pg_atomic_uint32 nextVictimBuffer; /* 16 4 */
> uint32 completePasses; /* 20 4 */
> pg_atomic_uint32 numBufferAllocs; /* 24 4 */
> pg_atomic_uint32 numRequestedAllocs; /* 28 4 */
> pg_atomic_uint64 numTotalAllocs; /* 32 8 */
> pg_atomic_uint64 numTotalRequestedAllocs; /* 40 8 */
> uint8 balance[32]; /* 48 32 */
>
> /* size: 80, cachelines: 2, members: 11 */
> /* sum members: 77, holes: 1, sum holes: 3 */
> /* last cacheline: 16 bytes */
> };
> maybe with smaller MAX_BUFFER_PARTITIONS we could pack this into size=64 ?
>
Possibly. Im not entirely happy with making the ClockSweep struct so
much larger, but I haven't found a better way to store the counters
needed for balancing. The only thing I can think of is storing it
outside the struct, and maybe that's the right thing to do.
But that assumes the current balancing approach is the right one.
> 3. In 004 sched_getcpu() is used and mentioned how to check if it is available
>
> But my $0.02 (maybe not that important), but I've at least saw once where
> (on EC2?) some clock_gettime() was very slow and that was because it was not
> available in VDSO. It's usually some mix of kernel <-> arch <-> libc (not
> always glibc?) compatibility matrix issue. My worry is that StrategyGetBuffer()
> -> ChooseClockSweep() -> ClockSweepPartitionIndex() -> sched_getcpu() would be
> available, but slow and it would mean real syscall price (and that's not once
> there per buffer). I'm also somehow thinking other platforms (FreeBSD comes to
> mind, but I haven't checked further). The point is: wouldn't it be cheaper
> that to be refreshed from time to time instead otherwise we risk some slow
> code on non-x86_64, but I doubt how proliferated is e.g. ARM64 with NUMA..
> Or alternative is to have pg_test_numa proggie and this would be measuring
> certain things about NUMA including timing of sched_getcpu (just like
> pg_test_timing does for time), at least that could explain why somebody's
> system/platform is slow.
>
Yes, I think we may need some sort of caching for this / check only
sometimes. I haven't seen it to matter, but that may be luck and on
other systems / platforms it may be worse.
> 4. Patch has problem (without fix for #8) that when number of available huge
> pages in the OS is greatly higher than shared_memory_size_in_huge_pages it
> will use only first NUMA node. This might be a problem when starting mulitple
> DBs (they will occupy first available NUMA):
>
> ### with s_b=8GB and nr_hugepages=1500 it's OK
>
> # find /sys/devices/system/node/ -name nr_hugepages -exec grep -H . {}
> \; | grep 2048 | sort
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1250
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1250
> /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:1250
> /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:1250
>
> ## note the correct split below for N0/N1..
> # grep huge /proc/`pgrep -f /usr/pgsql19/bin/postgres`/numa_maps
> 7fb1b4400000 default file=/anon_hugepage\040(deleted) huge dirty=4269
> mapmax=6 N0=1250 N1=1250 N2=519 N3=1250 kernelpagesize_kB=2048
>
> ### still s_b=8GB but nr_hugepages = 19000 (~37GB), it ends all on N0=4269
> # find /sys/devices/system/node/ -name nr_hugepages -exec grep -H . {}
> \; | grep 2048 | sort
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:4750
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:4750
> /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:4750
> /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:4750
> ## all on N0...
> # grep huge /proc/`pgrep -f /usr/pgsql19/bin/postgres`/numa_maps
> 7ff3a7a00000 default file=/anon_hugepage\040(deleted) huge dirty=4269
> mapmax=6 N0=4269 kernelpagesize_kB=2048
>
> I was even thinking go to lengths and add code for inspecting that /sys on
> some later date that the kernel NUMA hugepages are really distributed
> on the nodes as they should be (it's easy to end up on just 1 node out of
> many; allocating via sysctl -w <higher> and then <lower> allocation is easy
> way to force hugepages just to 1 node instead of many :o). I've hit the
> problem multiple times, so we should bail out if we want NUMA and the
> Buffer Blocks were just put on 1 node (instead of many).
>
How come the pg_numa_bind_to_node() calls don't move the parts to the
correct node?
If something is already using huge pages on the other nodes, then sure,
it will fail. But I think that's OK - it's a best-effort thing. Maybe we
should exit instead in this case?
> 5. In 005 we could mention more clealry what's the difference between
> those 3: numRequestedAllocs, numTotalAllocs, numTotalRequestedAllocs
> in the defintion to make it easier to read, maybe copy-cat those earlier
> descriptions there too as we already have:
>
> + * The balancing happens in intervals - it adjusts future allocations
> + * based on stats about recent allocations, namely:
> + *
> + * - numBufferAllocs - number of allocations served by a partition
> + *
> + * - numRequestedAllocs - number of allocatios requested in a partition
>
I agree the explanation for this is not entirely clear.
> 6. While at it, it would be helpful if we could reset the
> pg_buffercache_partitions stats in some way (pg_buffercache_partitions
> is very usefull).. or is there way to plug into main pg_reset functions?
>
Hmm, I was afraid it'd interfere with the balancing. I'm not sure it
makes sense to reset just some of the fields - it'd make it much harder
to interpret the counters. I'll think abou this.
> 7. If I add basic error checking for mbind() then it complains a lot, like
> below with annotated strace -ffe mbind to show the point:
>
> [pid 2856] mbind(0x7fd8d0e00000, 2145386496, MPOL_BIND, [], 0,
> MPOL_MF_MOVE) = -1 EINVAL (Invalid argument)
> WARNING: mbind(): Invalid argument
> WARNING: buffers descriptors for node 0 not well aligned
> [0x7fd8cccf5000,0x7fd8cdcf4fc1] aligned
> [0x7fd8cce00000,0x7fd8cdc00000]
>
> [pid 2856] mbind(0x7fd8cce00000, 14680064, MPOL_BIND, [], 0,
> MPOL_MF_MOVE) = -1 EINVAL (Invalid argument)
> WARNING: mbind(): Invalid argument
> WARNING: buffers for node 1 not well aligned
> [0x7fd950cf5000,0x7fd9d0cf5000] aligned
> [0x7fd950e00000,0x7fd9d0c00000]
>
> [pid 2856] mbind(0x7fd950e00000, 2145386496, MPOL_BIND,
> 0x5589057ded00, 1, MPOL_MF_MOVE) = -1 EINVAL (Invalid argument)
> WARNING: mbind(): Invalid argument
> WARNING: buffers descriptors for node 1 not well aligned
> [0x7fd8cdcf5000,0x7fd8cecf4fc1] aligned
> [0x7fd8cde00000,0x7fd8cec00000]
> [..]
>
> but with pg_numa.c fixed like below (node should be size):
> ret = mbind(startptr, (endptr - startptr),
> - MPOL_BIND, nodemask->maskp, node, MPOL_MF_MOVE);
> + MPOL_BIND, nodemask->maskp,
> nodemask->size, MPOL_MF_MOVE);
>
I think this is a silly bug on my side, clearly the nodemask should have
size > 0 even for node 0.
> it doesn't report errors anymore and suprisngly hugepages in numa_maps are
> altered from:
> 7fb1b4400000 default file=/anon_hugepage\040(deleted) huge dirty=4269
> mapmax=6 N0=1250 N1=1250 N2=519 N3=1250 kernelpagesize_kB=2048
>
> to explicit "binds":
> 7f8540000000 default file=/anon_hugepage\040(deleted) huge dirty=25
> mapmax=6 N0=25 kernelpagesize_kB=2048
> 7f8543200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=7
> mapmax=3 N0=7 kernelpagesize_kB=2048
> 7f8544000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> mapmax=2 N0=1 kernelpagesize_kB=2048
> 7f8544200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=7
> mapmax=2 N1=7 kernelpagesize_kB=2048
> 7f8545000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> mapmax=2 N0=1 kernelpagesize_kB=2048
> 7f8545200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=7
> mapmax=2 N2=7 kernelpagesize_kB=2048
> 7f8546000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> mapmax=2 N0=1 kernelpagesize_kB=2048
> 7f8546200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=7
> mapmax=2 N3=7 kernelpagesize_kB=2048
> 7f8547000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> mapmax=3 N0=1 kernelpagesize_kB=2048
> 7f8547200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=1023
> mapmax=2 N0=1023 kernelpagesize_kB=2048
> 7f85c7000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> N0=1 kernelpagesize_kB=2048
> 7f85c7200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1023
> N1=1023 kernelpagesize_kB=2048
> 7f8647000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> N1=1 kernelpagesize_kB=2048
> 7f8647200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=1023
> N2=1023 kernelpagesize_kB=2048
> 7f86c7000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> N3=1 kernelpagesize_kB=2048
> 7f86c7200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=1023
> N3=1023 kernelpagesize_kB=2048
> 7f8747000000 default file=/anon_hugepage\040(deleted) huge dirty=117
> mapmax=6 N2=117 kernelpagesize_kB=2048
>
> so lots of VMAs were created (it could affect performance in some way, I think
> for sure it would affect for worse fork() rates by postmaster for new conns).
>
Isn't this 4 VMAs for buffers ad 4 VMAs for buffer descriptors? Or maybe
it's the parts that could not be mapped due to insufficient alignment?
> To me it looks like there's plenty of "N[0..3]=1" with "default" indicating
> single HP page being somehow left/missed in address calculations, but I
> haven't pressed this harder.
>
> NOTE: the patch works even without this fix, but I believe if got non-0 we
> cannot reliably trust the optimizer memory layout has been deployed (I suspect
> it's some kind luck it sharded the shm based on number of hugepages available)
>
I'm not sure I understand what you mean. What non-0?
>
>> questions
>> ---------
>>
>> At this point, my main question is whether there's a better way to
>> partition clock-sweep and/or do the balancing of allocations between
>> partitions. I believe it does work, but I have a feeling there might be
>> a more elegant way to do this kind of stuff (like an established
>> balancing algorithm of some sort).
>
>
> 8. The crux of this email and stuff I wanted to further discuss, when server
> is started with on this 4-NUMA box with
> * numactl --cpunodebind=0 pg_ctl start # so that all backends fork()ing will be
> on node#0
> * the shm split onto 4 nodes properly
> * s_b still just 8GB (with ideal split),
> * DB size ~15GB with 8 pgbench partitions (and fully in VFS cache)
> * pgbench -c 8 -j 8 postgres -T 20 -P 1 -f seqconcurrscans.pgb with:
> \set num (:client_id % 8) + 1
> select sum(octet_length(filler)) from pgbench_accounts_:num;
> * mpstat repors correctly just node#0 used
>
> a. with the patch for GUCs with numa on and defaults two clocksweep settings
> on, I'm getting:
>
> latency average = 3252.254 ms
> latency stddev = 72.011 ms
>
> b. with debug_clocksweep_balance=off, I'm realiably getting
>
> latency average = 2688.742 ms
> latency stddev = 61.738 ms
>
> so IMHO clocksweep partitioning is cool, but if we are discussing the current
> balancing logic leaves some juice on the table from the most optimized variant
> (~1.2x) with ~90ns:270ns (local vs remote latency). In the picture above it
> was 8 backends accessing 8x 1.6GB tables (lower than NBuffers / 4).
>
How does this compare to master, i.e. without these NUMA patches?
> Dunno if it should be optimized further, certainly we'll get reports from
> quick benchmarks run by people that PG 20 could be *slower* because.. well,
> they got (sub)optimal layout during startup (all HP on 1 node and some
> query hitting just that one query with local affinity and this is visible
> to naked eye). I was re-reading thread and Andres also wrote "We should use
> the partitioned clock sweep to default to using local memory as long as
> possible."
>
> So two further ideas:
>
> I. BufferAccessStrategy: we could derrive affinity from the BAS strategy
> itself, couldn't we? If we are using capped ring buffer, we could indicate
> that we want it just from local node as priority disregarding weights (?).
> Same goes for BAS_VACUUM (why would one it on remote NUMA?). With BAS_BULKREAD
> there would be some potential issue with sync-scan-table code though.
> With BAS_BULKWRITE e.g. CTAS/CREATE INDEX it makes lot of sense too.
> prewarm could be hacked to use some new special BAS_DISTRIBUTE or something
> for ideal distribution amongst all NUMA nodes.
>
Yes, I think it might make sense to disable balancing in these cases.
> II. what if we could track if the relation is just all-local-access?
>
> Another idea is that if we would know that's it's just us working on some
> relation (created by us; or it's not being touched remotley) then we could
> ask for local-only memory affinity. So something like this:
>
> a. in case of locally-only access rels =>
> ask for local memory first
> if that fails failback to weighted RR (so to to weights)
> b. in case other rels => weighted RR (so to to weights) directly
>
> The tracking of the fact that Buffer was accessed just locally or remotley
> itself is not hard to imagine (by using some free "bits" in BufferDesc.
> "state" where refcount/usagecount itself are stored, well at least 4 bits
> for my 4 nodes, but there's plenty of left there), I have some PoC for
> that but that's just per-Buffer tracking of "was this Buffer accessed by
> remote nodes", but I'm completley lost how to make transition to the
> is-the-relation-being-accessed-accross-NUMA-nodes info to drive such
> optimization (we would need some shared infra just for tracking such info;
> assuming up to 2^31 or 2^32 relations [OID?] and just using at least >=
> 2..4 bits, that's already huge number: we are talking GBs of shm mem).
>> BTW: I've been experimenting with this patchset and added couple of things
> (see attached), and with I'm able to get optimal just by forcing affinity
> too using that earlier bench:
>
> latency average = 2512.929 ms
> latency stddev = 97.525 ms
>
> and that was with pure 100% affinity to my local node:
>
> select pg_buffercache_set_partition(0, '{100,0,0,0}');
> debug_clocksweep_balance_recalc=off
> debug_clocksweep_balance=on
> debug_clocksweep_scan_all_partitions=on
>
> (so it's another proof that code is fine, it's just algorithm that would
> have to adjusted)
>
> For benchmarks with pgbench -S for 100% local affinity vs 100% remote
> (I can do that with that pg_buffercache_set_partition() of mine), I'm
> getting just +/- 1-2k TPS (42-43k TPS vs 41-42k TPS), so not much, __but__
> I've spotted some another bug in from where we are fetching memory from
> unoptimal places if we are not on node#0, I'll need to dig into that more
> though. Another thing is that pgbench -S runs are much less demanding in
> terms of memory bandwidth used (under <1GB/s here vs 6-8GB/s for
> seqconcurrscans.sql using the same amout of cores).
>
No opinion. I need to look at this closer.
>> The other thing I need to verify is how this behaves with
>> kernel.nr_hugepages. With some earlier versions it was easy to end in a
>> situation where everything seemed to work, but then much later the
>> kernel realized it does not have enough huge pages on a particular NUMA
>> node and crashed with a segfault (or was it sigbus?).
>
> It was SIGBUS and with this patchset I think we are fine: I have never
> witnessed this one crashing with SIGBUS.
>
Good. But I wonder if allocating just the precise number of huge pages
(per shared_memory_size_in_huge_pages) can prevent moving the partitions
to the correct node.
>> Of course, the other question is performance validation - does it even
>> help? I plan to repeat the various experiments mentioned in this thread
>> (by Andres and others) on available NUMA machines. But if someone has an
>> idea for another benchmark (and/or what metric to measure, not just the
>> usual duration), let me know.
>
> See above, but I think we would have to fix at at least: mbind() failure,
> and those VMAs disconnected regions.
>
Yes, the mbind() failure is a bug.
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Álvaro Herrera | 2026-06-16 12:44:44 | Re: Make frontend programs relink after libpgfeutils changes |
| Previous Message | Alexander Korotkov | 2026-06-16 12:22:02 | Re: Add SPLIT PARTITION/MERGE PARTITIONS commands |