Quick Links

Re: Adding basic NUMA awareness

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding basic NUMA awareness
Date:	2026-06-17 12:13:03
Message-ID:	CAKZiRmzumFJ8+Bqz5mfX+QGkrKEaH32zz7kPgEShHWXE4QzyQw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jun 16, 2026 at 2:39 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>
>
> On 6/16/26 10:16, Jakub Wartak wrote:
> > On Fri, Jun 5, 2026 at 2:52 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>
> >> Hi,
> >
> > Hi Tomas, thanks for working on this.
> >
> >> Here's an updated version of the NUMA patch series, based on some recent
> >> discussions about this (some at pgconf.dev, but not only that),
> > [..]
> >
> > 1. 005 says:
> >
> > + * XXX We should enforce this in bufmgr.c, when initializing the partitions.
> > + */
> > +#define MAX_BUFFER_PARTITIONS 32
> >
> > but there isn't direct any check for checking if pg_numa_get_max_node() ->
> > numa_max_node() is not getting higher than allowed here. In theory this could
> > happen I think if ClockSweepPartitionIndex() would return
> > numa = numa_node_of_cpu()
> > on some hypothethical very high-end setup (with plenty of sub-NUMA nodes)
> > and that would cause accesing .balance[] without bounds.
> >
>
> Yes, this should be capped to the MAX_BUFFER_PARTITIONS.
>
> > 2. If we have in 004 struct ClockSweep with nextVictimBuffer, shouldn't
> > this be padded/aligned somehow later in BufferStrategyControl which does
> > ClockSweep sweeps[FLEXIBLE_ARRAY_MEMBER];
> > to avoid contention/false sharing? (comments says it should be but it
> > doesn't seem so?), maybe the comment should be TODO for now? I have not
> > quantified any potential benefit
> >
> > With pahole after some hassle I've got:
> > struct ClockSweep {
[..]
> > pg_atomic_uint32 nextVictimBuffer; /* 16 4 */
[..]
> > /* size: 80, cachelines: 2, members: 11 */
> > /* sum members: 77, holes: 1, sum holes: 3 */
> > /* last cacheline: 16 bytes */
> > };
> > maybe with smaller MAX_BUFFER_PARTITIONS we could pack this into size=64 ?
> >
>
> Possibly. Im not entirely happy with making the ClockSweep struct so
> much larger, but I haven't found a better way to store the counters
> needed for balancing. The only thing I can think of is storing it
> outside the struct, and maybe that's the right thing to do.
>
> But that assumes the current balancing approach is the right one.

Yeah, I'm just not sure if there is some wasted performance due to
false-sharing in very heavy benchmark scenarios (in theory the
nextVictimBuffer could bounce rather heavily).

> > 3. In 004 sched_getcpu() is used and mentioned how to check if it is available
> >
> > But my $0.02 (maybe not that important), but I've at least saw once where
> > (on EC2?) some clock_gettime() was very slow and that was because it was not
> > available in VDSO. It's usually some mix of kernel <-> arch <-> libc (not
> > always glibc?) compatibility matrix issue. My worry is that StrategyGetBuffer()
> > -> ChooseClockSweep() -> ClockSweepPartitionIndex() -> sched_getcpu() would be
> > available, but slow and it would mean real syscall price (and that's not once
> > there per buffer). I'm also somehow thinking other platforms (FreeBSD comes to
> > mind, but I haven't checked further). The point is: wouldn't it be cheaper
> > that to be refreshed from time to time instead otherwise we risk some slow
> > code on non-x86_64, but I doubt how proliferated is e.g. ARM64 with NUMA..
> > Or alternative is to have pg_test_numa proggie and this would be measuring
> > certain things about NUMA including timing of sched_getcpu (just like
> > pg_test_timing does for time), at least that could explain why somebody's
> > system/platform is slow.
> >
>
> Yes, I think we may need some sort of caching for this / check only
> sometimes. I haven't seen it to matter, but that may be luck and on
> other systems / platforms it may be worse.

Okay, so maybe an action point for us much later would be to try this on 2s+
ARM or some other much more rare setup just to see if we need to do it at all
(perhaps annotate it with TODO, I have short memory ;))

> > 4. Patch has problem (without fix for #8) that when number of available huge
> > pages in the OS is greatly higher than shared_memory_size_in_huge_pages it
> > will use only first NUMA node. This might be a problem when starting mulitple
> > DBs (they will occupy first available NUMA):
> >
> > ### with s_b=8GB and nr_hugepages=1500 it's OK
> >
> > # find /sys/devices/system/node/ -name nr_hugepages -exec grep -H . {}
> > \; | grep 2048 | sort
> > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1250
> > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1250
> > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:1250
> > /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:1250
> >
> > ## note the correct split below for N0/N1..
> > # grep huge /proc/`pgrep -f /usr/pgsql19/bin/postgres`/numa_maps
> > 7fb1b4400000 default file=/anon_hugepage\040(deleted) huge dirty=4269
> > mapmax=6 N0=1250 N1=1250 N2=519 N3=1250 kernelpagesize_kB=2048
> >
> > ### still s_b=8GB but nr_hugepages = 19000 (~37GB), it ends all on N0=4269
> > # find /sys/devices/system/node/ -name nr_hugepages -exec grep -H . {}
> > \; | grep 2048 | sort
> > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:4750
> > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:4750
> > /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:4750
> > /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:4750
> > ## all on N0...
> > # grep huge /proc/`pgrep -f /usr/pgsql19/bin/postgres`/numa_maps
> > 7ff3a7a00000 default file=/anon_hugepage\040(deleted) huge dirty=4269
> > mapmax=6 N0=4269 kernelpagesize_kB=2048
> >
> > I was even thinking go to lengths and add code for inspecting that /sys on
> > some later date that the kernel NUMA hugepages are really distributed
> > on the nodes as they should be (it's easy to end up on just 1 node out of
> > many; allocating via sysctl -w <higher> and then <lower> allocation is easy
> > way to force hugepages just to 1 node instead of many :o). I've hit the
> > problem multiple times, so we should bail out if we want NUMA and the
> > Buffer Blocks were just put on 1 node (instead of many).
> >
>
> How come the pg_numa_bind_to_node() calls don't move the parts to the
> correct node?

Well, if we were not checking error code mbind() it could behave in
non-deterministic way I think (you ask for MPOL_BIND, maybe it does something,
maybe it doesnt work but we continued anyway).

> If something is already using huge pages on the other nodes, then sure,
> it will fail. But I think that's OK - it's a best-effort thing. Maybe we
> should exit instead in this case?

Yes, I think we should exit with FATAL.

> > 5. In 005 we could mention more clealry what's the difference between
> > those 3: numRequestedAllocs, numTotalAllocs, numTotalRequestedAllocs
> > in the defintion to make it easier to read, maybe copy-cat those earlier
> > descriptions there too as we already have:
> >
> > + * The balancing happens in intervals - it adjusts future allocations
> > + * based on stats about recent allocations, namely:
> > + *
> > + * - numBufferAllocs - number of allocations served by a partition
> > + *
> > + * - numRequestedAllocs - number of allocatios requested in a partition
> >
>
> I agree the explanation for this is not entirely clear.
>
> > 6. While at it, it would be helpful if we could reset the
> > pg_buffercache_partitions stats in some way (pg_buffercache_partitions
> > is very usefull).. or is there way to plug into main pg_reset functions?
> >
>
> Hmm, I was afraid it'd interfere with the balancing. I'm not sure it
> makes sense to reset just some of the fields - it'd make it much harder
> to interpret the counters. I'll think abou this.

Well, small hint: I find the the view very, very usefull in explaining
where/why memory is being allocated from. Maybe if we want to include it
in final version, then it might be worth (later on) to do it, if that
won't be included I'm fine just doing \watch 1 in psql to see what's
happening. Another hint is that maybe it doesnt belong to pg_buffercache
and would make it easier to integrate into pg_stat_reset*() -- dunno, how/
if extension can plug into it.

> > 7. If I add basic error checking for mbind() then it complains a lot, like
> > below with annotated strace -ffe mbind to show the point:
> >[..]
>
> I think this is a silly bug on my side, clearly the nodemask should have
> size > 0 even for node 0.
>
> > it doesn't report errors anymore and suprisngly hugepages in numa_maps are
> > altered from:
> > 7fb1b4400000 default file=/anon_hugepage\040(deleted) huge dirty=4269
> > mapmax=6 N0=1250 N1=1250 N2=519 N3=1250 kernelpagesize_kB=2048
> >
> > to explicit "binds":
> > 7f8540000000 default file=/anon_hugepage\040(deleted) huge dirty=25
> > mapmax=6 N0=25 kernelpagesize_kB=2048
> > 7f8543200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=7
> > mapmax=3 N0=7 kernelpagesize_kB=2048
> > 7f8544000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> > mapmax=2 N0=1 kernelpagesize_kB=2048
> > 7f8544200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=7
> > mapmax=2 N1=7 kernelpagesize_kB=2048
> > 7f8545000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> > mapmax=2 N0=1 kernelpagesize_kB=2048
> > 7f8545200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=7
> > mapmax=2 N2=7 kernelpagesize_kB=2048
> > 7f8546000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> > mapmax=2 N0=1 kernelpagesize_kB=2048
> > 7f8546200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=7
> > mapmax=2 N3=7 kernelpagesize_kB=2048
> > 7f8547000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> > mapmax=3 N0=1 kernelpagesize_kB=2048
> > 7f8547200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=1023
> > mapmax=2 N0=1023 kernelpagesize_kB=2048
> > 7f85c7000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> > N0=1 kernelpagesize_kB=2048
> > 7f85c7200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1023
> > N1=1023 kernelpagesize_kB=2048
> > 7f8647000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> > N1=1 kernelpagesize_kB=2048
> > 7f8647200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=1023
> > N2=1023 kernelpagesize_kB=2048
> > 7f86c7000000 default file=/anon_hugepage\040(deleted) huge dirty=1
> > N3=1 kernelpagesize_kB=2048
> > 7f86c7200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=1023
> > N3=1023 kernelpagesize_kB=2048
> > 7f8747000000 default file=/anon_hugepage\040(deleted) huge dirty=117
> > mapmax=6 N2=117 kernelpagesize_kB=2048
> >
> > so lots of VMAs were created (it could affect performance in some way, I think
> > for sure it would affect for worse fork() rates by postmaster for new conns).
> >
>
> Isn't this 4 VMAs for buffers ad 4 VMAs for buffer descriptors? Or maybe
> it's the parts that could not be mapped due to insufficient alignment?

Yes, like with patchset I'm getting WARNINGs:

WARNING: buffers for node 0 not well aligned
[0x7efc100f5000,0x7efc900f5000] aligned
[0x7efc10200000,0x7efc90000000]
WARNING: buffers descriptors for node 0 not well aligned
[0x7efc0c0f4f80,0x7efc0d0f4f41] aligned
[0x7efc0c200000,0x7efc0d000000]
WARNING: buffers for node 1 not well aligned
[0x7efc900f5000,0x7efd100f5000] aligned
[0x7efc90200000,0x7efd10000000]
WARNING: buffers descriptors for node 1 not well aligned
[0x7efc0d0f4f80,0x7efc0e0f4f41] aligned
[0x7efc0d200000,0x7efc0e000000]
WARNING: buffers for node 2 not well aligned
[0x7efd100f5000,0x7efd900f5000] aligned
[0x7efd10200000,0x7efd90000000]
WARNING: buffers descriptors for node 2 not well aligned
[0x7efc0e0f4f80,0x7efc0f0f4f41] aligned
[0x7efc0e200000,0x7efc0f000000]
WARNING: buffers for node 3 not well aligned
[0x7efd900f5000,0x7efe100f5000] aligned
[0x7efd90200000,0x7efe10000000]
WARNING: buffers descriptors for node 3 not well aligned
[0x7efc0f0f4f80,0x7efc100f4f41] aligned
[0x7efc0f200000,0x7efc10000000]
LOG: starting PostgreSQL 19beta1 on x86_64-linux, compiled by
gcc-12.2.0, 64-bit

relevant numa_maps for postmaster:

7efc0c200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=2 N0=7 kernelpagesize_kB=2048
7efc0d000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=2 N3=1 kernelpagesize_kB=2048
7efc0d200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=2 N1=7 kernelpagesize_kB=2048
7efc0e000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=2 N3=1 kernelpagesize_kB=2048
7efc0e200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=2 N2=7 kernelpagesize_kB=2048
7efc0f000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=2 N3=1 kernelpagesize_kB=2048
7efc0f200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=7
mapmax=2 N3=7 kernelpagesize_kB=2048
7efc10000000 default file=/anon_hugepage\040(deleted) huge dirty=1
mapmax=3 N3=1 kernelpagesize_kB=2048
7efc10200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=1023
N0=1023 kernelpagesize_kB=2048
7efc90000000 default file=/anon_hugepage\040(deleted) huge dirty=1
N3=1 kernelpagesize_kB=2048
7efc90200000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1023
N1=1023 kernelpagesize_kB=2048
7efd10000000 default file=/anon_hugepage\040(deleted) huge dirty=1
N0=1 kernelpagesize_kB=2048
7efd10200000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=1023
N2=1023 kernelpagesize_kB=2048
7efd90000000 default file=/anon_hugepage\040(deleted) huge dirty=1
N2=1 kernelpagesize_kB=2048
7efd90200000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=1023
N3=1023 kernelpagesize_kB=2048
7efe10000000 default file=/anon_hugepage\040(deleted) huge dirty=116
mapmax=6 N1=116 kernelpagesize_kB=2048

so if You take just 7efc10200000..7efc90000000 (from buffers(at)node0, 1st line)
and zoom in/grep You'll get:

7efc10200000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=1023
N0=1023 kernelpagesize_kB=2048
7efc90000000 default file=/anon_hugepage\040(deleted) huge dirty=1
N3=1 kernelpagesize_kB=2048

so @ 7efc90000000 it's Node3=1 hugepage (so it's wrong by 1x 2048kb offset)

so it contains the tail of node 0 buffers and the head of node - and is
excluded from both mbind() calls. At first I've spotted this in 003/
BufferPartitionsInit() with:
cstartptr = (char *) &BufferDescriptors[buff_first];
endptr = (char *) &BufferDescriptors[buff_last] + 1; // <-- BUG?
with
endptr = (char *) &BufferDescriptors[buff_last + 1];

but got the same issue, so I've forced the mbind() to use 2x TYPEALIGN_DOWN
to cover with policy everything and got:

7fc134e00000 default file=/anon_hugepage\040(deleted) huge dirty=24
mapmax=6 N3=24 kernelpagesize_kB=2048
7fc137e00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=8
mapmax=4 N0=8 kernelpagesize_kB=2048
7fc138e00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=8
mapmax=2 N1=8 kernelpagesize_kB=2048
7fc139e00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=8
mapmax=3 N2=8 kernelpagesize_kB=2048
7fc13ae00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=8
mapmax=2 N3=8 kernelpagesize_kB=2048
7fc13be00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=1024
N0=1024 kernelpagesize_kB=2048
7fc1bbe00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1024
N1=1024 kernelpagesize_kB=2048
7fc23be00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=1024
mapmax=2 N2=1024 kernelpagesize_kB=2048
7fc2bbe00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=1024
N3=1024 kernelpagesize_kB=2048
7fc33be00000 default file=/anon_hugepage\040(deleted) huge dirty=116
mapmax=6 N1=116 kernelpagesize_kB=2048

that was with:
@@ -351,7 +351,7 @@ BufferPartitionsInit(void)
}

/* best effort: align the pointers, so that
the mbind() works */
- startptr = (char *) TYPEALIGN(numa_page_size, startptr);
+ startptr = (char *)
TYPEALIGN_DOWN(numa_page_size, startptr);
endptr = (char *)
TYPEALIGN_DOWN(numa_page_size, endptr);

@@ -374,7 +374,7 @@ BufferPartitionsInit(void)
}

Looks way better, but it was wild guess (TBH I think I've prefered the
previous patchset version where the input shm addresses were already aligned,
but if You and others say it's easier route then sure)

> > To me it looks like there's plenty of "N[0..3]=1" with "default" indicating
> > single HP page being somehow left/missed in address calculations, but I
> > haven't pressed this harder.
> >
> > NOTE: the patch works even without this fix, but I believe if got non-0 we
> > cannot reliably trust the optimizer memory layout has been deployed (I suspect
> > it's some kind luck it sharded the shm based on number of hugepages available)
> >
>
> I'm not sure I understand what you mean. What non-0?

Non-0 return code from mbind() as the mbind() failed with -1, but it did somehow
alter numa pages probably(?), but without creating specific "isolated" VMA
explicit "bind" policy, probably maybe it did not even work at all..

> >
> >> questions
> >> ---------
> >>
> >> At this point, my main question is whether there's a better way to
> >> partition clock-sweep and/or do the balancing of allocations between
> >> partitions. I believe it does work, but I have a feeling there might be
> >> a more elegant way to do this kind of stuff (like an established
> >> balancing algorithm of some sort).
> >
> >
> > 8. The crux of this email and stuff I wanted to further discuss, when server
> > is started with on this 4-NUMA box with
> > * numactl --cpunodebind=0 pg_ctl start # so that all backends fork()ing will be
> > on node#0
> > * the shm split onto 4 nodes properly
> > * s_b still just 8GB (with ideal split),
> > * DB size ~15GB with 8 pgbench partitions (and fully in VFS cache)
> > * pgbench -c 8 -j 8 postgres -T 20 -P 1 -f seqconcurrscans.pgb with:
> > \set num (:client_id % 8) + 1
> > select sum(octet_length(filler)) from pgbench_accounts_:num;
> > * mpstat repors correctly just node#0 used
> >
> > a. with the patch for GUCs with numa on and defaults two clocksweep settings
> > on, I'm getting:
> >
> > latency average = 3252.254 ms
> > latency stddev = 72.011 ms
> >
> > b. with debug_clocksweep_balance=off, I'm realiably getting
> >
> > latency average = 2688.742 ms
> > latency stddev = 61.738 ms
> >
> > so IMHO clocksweep partitioning is cool, but if we are discussing the current
> > balancing logic leaves some juice on the table from the most optimized variant
> > (~1.2x) with ~90ns:270ns (local vs remote latency). In the picture above it
> > was 8 backends accessing 8x 1.6GB tables (lower than NBuffers / 4).
> >
>
> How does this compare to master, i.e. without these NUMA patches?

I had some problems comparing, but with "perfect setup" that includes the
following:
- pgbench/clients on node#0
- backends running on node#0
- hugepages memory on node#0..3, but with with this patchset and those goodies:
debug_clocksweep_balance=on
debug_clocksweep_balance_recalc=off
debug_clocksweep_scan_all_partitions=on
(so technically node0 backends accessing just Buffers / Buffers Desc from
node0, technically node0 weights: "{100,0,0,0}")
- with today's new discovery for me that Linux's kernel VFS cache is also
having also first-touch (!) NUMA policy and really important here (so
VFS cached data also alters results of testing wildly!), so I had to
force unloading VFS cache and force-loading it into node#0, I've was
getting for seqconcurrscans:
latency average = 2701.705 ms
latency stddev = 111.608 ms

vs master, huh, but which scenario? the default one without any affinity?
- assuming you get the split shm split like "N0=1059 N1=1299 N2=879 N3=1031"
but that's appeneded-only (so not interleaved (?)) and you even risk
having shm placed on just __one__ node (if it is big enough and free
enough)
- if you go straight to benchmarking it (CPU hits random nodes)
latency average = 3439.200 ms
latency stddev = 580.501 ms
- with backends forked() to node#0 (numactl --cpunodebind=0 pg_ctl start)
latency average = 4937.543 ms
latency stddev = 573.841 ms
because of random VFS cache placement (I imagine it as flow of on node0 CPUs
a. nextVictimBuffer contention
b. getBuffer() - fetch random shm memory from random NUMA node
c. pread() - fetch from VFS cache but from *remote* NUMA node
)
- with backends forked() to node#0 and pinned VFS cached fully on node#0 too
latency average = 4518.651 ms
latency stddev = 797.369 ms
(but this is still Buffers from other nodes)
- same as above above and numactl interleaved shm:
latency average = 3792.016 ms
latency stddev = 825.186 ms
- same as above and interleaved shm, but without pining to CPUs on specifc
node and ensure random VFS cache vs nodes:
latency average = 2913.813 ms
latency stddev = 352.552 ms
- but the moment you read anything reads base/ into VFS cache to particular
node (imagine pg_prewarm or even just tar) assuming it was not there it
also pins that to that node memory and you'll get:
latency average = 3594.180 ms
latency stddev = 851.949 ms

> > Dunno if it should be optimized further, certainly we'll get reports from
> > quick benchmarks run by people that PG 20 could be *slower* because.. well,
> > they got (sub)optimal layout during startup (all HP on 1 node and some
> > query hitting just that one query with local affinity and this is visible
> > to naked eye). I was re-reading thread and Andres also wrote "We should use
> > the partitioned clock sweep to default to using local memory as long as
> > possible."
> >
> > So two further ideas:
> >
> > I. BufferAccessStrategy: we could derrive affinity from the BAS strategy
> > itself, couldn't we? If we are using capped ring buffer, we could indicate
> > that we want it just from local node as priority disregarding weights (?).
> > Same goes for BAS_VACUUM (why would one it on remote NUMA?). With BAS_BULKREAD
> > there would be some potential issue with sync-scan-table code though.
> > With BAS_BULKWRITE e.g. CTAS/CREATE INDEX it makes lot of sense too.
> > prewarm could be hacked to use some new special BAS_DISTRIBUTE or something
> > for ideal distribution amongst all NUMA nodes.
> >
>
> Yes, I think it might make sense to disable balancing in these cases.

OK, I did not code anything of that as

> > II. what if we could track if the relation is just all-local-access?
> >
> > Another idea is that if we would know that's it's just us working on some
> > relation (created by us; or it's not being touched remotley) then we could
> > ask for local-only memory affinity. So something like this:
> >
> > a. in case of locally-only access rels =>
> > ask for local memory first
> > if that fails failback to weighted RR (so to to weights)
> > b. in case other rels => weighted RR (so to to weights) directly
> >
> > The tracking of the fact that Buffer was accessed just locally or remotley
> > itself is not hard to imagine (by using some free "bits" in BufferDesc.
> > "state" where refcount/usagecount itself are stored, well at least 4 bits
> > for my 4 nodes, but there's plenty of left there), I have some PoC for
> > that but that's just per-Buffer tracking of "was this Buffer accessed by
> > remote nodes", but I'm completley lost how to make transition to the
> > is-the-relation-being-accessed-accross-NUMA-nodes info to drive such
> > optimization (we would need some shared infra just for tracking such info;
> > assuming up to 2^31 or 2^32 relations [OID?] and just using at least >=
> > 2..4 bits, that's already huge number: we are talking GBs of shm mem).
> >> BTW: I've been experimenting with this patchset and added couple of things
> > (see attached), and with I'm able to get optimal just by forcing affinity
> > too using that earlier bench:
> >
> > latency average = 2512.929 ms
> > latency stddev = 97.525 ms
> >
> > and that was with pure 100% affinity to my local node:
> >
> > select pg_buffercache_set_partition(0, '{100,0,0,0}');
> > debug_clocksweep_balance_recalc=off
> > debug_clocksweep_balance=on
> > debug_clocksweep_scan_all_partitions=on
> >
> > (so it's another proof that code is fine, it's just algorithm that would
> > have to adjusted)
> >
> > For benchmarks with pgbench -S for 100% local affinity vs 100% remote
> > (I can do that with that pg_buffercache_set_partition() of mine), I'm
> > getting just +/- 1-2k TPS (42-43k TPS vs 41-42k TPS), so not much, __but__
> > I've spotted some another bug in from where we are fetching memory from
> > unoptimal places if we are not on node#0, I'll need to dig into that more
> > though. Another thing is that pgbench -S runs are much less demanding in
> > terms of memory bandwidth used (under <1GB/s here vs 6-8GB/s for
> > seqconcurrscans.sql using the same amout of cores).
> >
>
> No opinion. I need to look at this closer.

Great !

> >> The other thing I need to verify is how this behaves with
> >> kernel.nr_hugepages. With some earlier versions it was easy to end in a
> >> situation where everything seemed to work, but then much later the
> >> kernel realized it does not have enough huge pages on a particular NUMA
> >> node and crashed with a segfault (or was it sigbus?).
> >
> > It was SIGBUS and with this patchset I think we are fine: I have never
> > witnessed this one crashing with SIGBUS.
> >
>
> Good. But I wonder if allocating just the precise number of huge pages
> (per shared_memory_size_in_huge_pages) can prevent moving the partitions
> to the correct node.
>

Not sure I understand (?) how's that related to SIGBUS?

-J.

In response to

Re: Adding basic NUMA awareness at 2026-06-16 12:39:45 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jelte Fennema-Nio	2026-06-17 12:22:12	Re: Type assertions without GCC builtins
Previous Message	Jelte Fennema-Nio	2026-06-17 12:11:24	Re: Change copyObject() to use typeof_unqual