Quick Links

Re: Adding basic NUMA awareness

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc:	Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Adding basic NUMA awareness
Date:	2025-11-11 11:52:01
Message-ID:	7b824e42-02de-4f4a-a81d-5acc89d417ea@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

here's a rebased patch series, fixing most of the smaller issues from
v20251101, and making cfbot happy (hopefully).

On 11/6/25 15:02, Jakub Wartak wrote:
> On Tue, Nov 4, 2025 at 10:21 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> Hi Tomas,
>
>>> 0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
>>> bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
>>> I've thought it is going to be simpler to use to_hex() -- see 0009
>>> attached.
>>>
>>
>> I don't know. I added simply because it might be useful for development,
>> but we probably don't want to expose these pointers at all.
>>
>>> 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
>>> called pg_shm_pgproc?
>>>
>>
>> Right. It does not belong to pg_buffercache at all, I just added it
>> there because I've been messing with that code already.
>
> Please keep them in for at least for some time (perhaps standalone
> patch marked as not intended to be commited would work?). I find the
> view extermely useful as it will allow us pinpointing local-vs-remote
> NUMA fetches (we need to know the addres).
>

Are you referring to the _pgproc view specifically, or also to the view
with buffer partitions? I don't intend to remove the view for shared
buffers, that's indeed useful.

>>> 0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
>>> during start:
>>>
>>> 2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA:
>>> pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
>>> num_procs 2523 node 0
>>> 2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA:
>>> pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
>>> num_procs 2523 node 1
>>> 2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA:
>>> pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
>>> num_procs 2523 node 2
>>> 2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA:
>>> pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
>>> num_procs 2523 node 3
>>> 2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA:
>>> pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
>>> num_procs 38 node -1
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>>
>>
>> I'll take a look, but I don't recall seeing such errors.
>>
>
> Alexy also reported this earlier, here
> https://www.postgresql.org/message-id/92e23c85-f646-4bab-b5e0-df30d8ddf4bd%40postgrespro.ru
> (just use HP, set some high max_connections). I've double checked this
> too , numa_tonode_memory() len needs to HP size.
>

OK, I'll investigate this.

>>> 0007d: so we probably need numa_warn()/numa_error() wrappers (this was
>>> initially part of NUMA observability patches but got removed during
>>> the course of action), I'm attaching 0008. With that you'll get
>>> something a little more up to our standards:
>>> 2025-11-04 10:27:07.140 CET [59696] DEBUG:
>>> fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
>>> 0x7f4f4d4b1660
>>> 2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind
>>>
>>
>> Not sure.
>
> Any particular objections? We need to somehow emit them into the logs.
>

No idea, I think it'd be better to make sure this failure can't happen,
but maybe it's not possible. I don't understand the mbind failure well
enough.

>>> 0007f: The "mbind: Invalid argument"" issue itself with the below addition:
> [..]
>>>
>>> but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
>>> 0xB1660 = 726624 bytes, but if adjust blindly endptr in that
>>> fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
>>> (HP) it doesn't complain anymore and I get success:
> [..]
>>
>> Hmm, so it seems like another hugepage-related issue. The mbind manpage
>> says this about "len":
>>
>> EINVAL An invalid value was specified for flags or mode; or addr + len
>> was less than addr; or addr is not a multiple of the system page size.
>>
>> I don't think that requires (addr+len) to be a multiple of page size,
>> but maybe that is required.
>
> I do think that 'system page size' means above HP page size, but this
> time it's just for fastpath_partition_init(), the earlier one seems to
> aligned fine (?? -- i havent really checked but there's no error)
>

Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC
partitioning, I don't think that's likely to go into 19.

>>> 0006d: I've got one SIGBUS during a call to select
>>> pg_buffercache_numa_pages(); and it looks like that memory accessed is
>>> simply not mapped? (bug)
>>>
>>> Program received signal SIGBUS, Bus error.
>>> pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
>>> ../contrib/pg_buffercache/pg_buffercache_pages.c:386
>>> 386 pg_numa_touch_mem_if_required(ptr);
>>> (gdb) print ptr
>>> $1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
>>> (gdb) where
>>> #0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
>>> ../contrib/pg_buffercache/pg_buffercache_pages.c:386
>>> #1 0x0000561a672a0efe in ExecMakeFunctionResultSet
>>> (fcache=0x561a97e8e5d0, econtext=econtext(at)entry=0x561a97e8dab8,
>>> argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
>>> isDone=isDone(at)entry=0x561a97e8e5c0) at
>>> ../src/backend/executor/execSRF.c:624
>>> [..]
>>>
>>> Postmaster had still attached shm (visible via smaps), and if you
>>> compare closely 0x7f4ed0200000 against sorted smaps:
>>>
>>> 7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
>>> /anon_hugepage (deleted)
>>> 7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
>>> /anon_hugepage (deleted)
>>> 7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
>>> /anon_hugepage (deleted)
>>> 7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
>>> /anon_hugepage (deleted)
>>> 7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
>>> /anon_hugepage (deleted)
>>>
>>> it's NOT there at all (there's no mmap region starting with
>>> 0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
>>> aware of this new mmaped() regions and instead does simple loop over
>>> all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
>>> os_page_size)"?
>>>
>>
>> I'm confused. How could that mapping be missing? Was this with huge
>> pages / how many did you reserve on the nodes?
>
>
> OK I made and error and paritally got it correct (it crashes reliably)
> and partially mislead You, appologies, let me explain. There were two
> questions for me:
> a) why we make single mmap() and after numa_tonode_memory() we get
> plenty of mappings
> b) why we get SIGBUS (I've thought they are not continus, but they are
> after triple-checking)
>
> ad a) My testing shows that on HP,as stated initially ("all of this
> was on 4s/4 NUMA nodes with HP on"). That's what the codes does, you
> get single mmaps() (resulting in single entry in smaps), but afte
> noda_tonode_memory() there's many of them. Even on laptop:
>
> System has 1 NUMA nodes (0 to 0).
> Attempting to allocate 8.000000 MB of HugeTLB memory...
> Successfully allocated HugeTLB memory at 0x755828800000, smaps before:
> 755828800000-755829000000 rw-s 00000000 00:11 259808
> /anon_hugepage (deleted)
> Pinning first part (from 0x755828800000) to NUMA node 0...
> smaps after:
> 755828800000-755828c00000 rw-s 00000000 00:11 259808
> /anon_hugepage (deleted)
> 755828c00000-755829000000 rw-s 00400000 00:11 259808
> /anon_hugepage (deleted)
> Pinning second part (from 0x755828c00000) to NUMA node 0...
> smaps after:
> 755828800000-755828c00000 rw-s 00000000 00:11 259808
> /anon_hugepage (deleted)
> 755828c00000-755829000000 rw-s 00400000 00:11 259808
> /anon_hugepage (deleted)
>
> It gets even more funny, below I have 8MB HP=on, but just issue 2x
> numa_tonode_memory(for len 2MB on 4MB ptr to node0) (two times for
> ptr, second time in half of that):
>
> System has 1 NUMA nodes (0 to 0).
> Attempting to allocate 8.000000 MB of HugeTLB memory...
> Successfully allocated HugeTLB memory at 0x7302dda00000, smaps before:
> 7302dda00000-7302de200000 rw-s 00000000 00:11 284859
> /anon_hugepage (deleted)
> Pinning first part (from 0x7302dda00000) to NUMA node 0...
> smaps after:
> 7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
> /anon_hugepage (deleted)
> 7302ddc00000-7302de200000 rw-s 00200000 00:11 284859
> /anon_hugepage (deleted)
> Pinning second part (from 0x7302dde00000) to NUMA node 0...
> smaps after:
> 7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
> /anon_hugepage (deleted)
> 7302ddc00000-7302dde00000 rw-s 00200000 00:11 284859
> /anon_hugepage (deleted)
> 7302dde00000-7302de000000 rw-s 00400000 00:11 284859
> /anon_hugepage (deleted)
> 7302de000000-7302de200000 rw-s 00600000 00:11 284859
> /anon_hugepage (deleted)
>
> Why 4 instead of 1? Because some mappings are now "default" becauswe
> their policy was not altered:
>
> $ grep huge /proc/$(pidof testnumammapsplit)/numa_maps
> 7302dda00000 bind:0 file=/anon_hugepage\040(deleted) huge
> 7302ddc00000 default file=/anon_hugepage\040(deleted) huge
> 7302dde00000 bind:0 file=/anon_hugepage\040(deleted) huge
> 7302de000000 default file=/anon_hugepage\040(deleted) huge
>
> Back to originnal error, they are consecutive regions and earlier problem is
>
> error: 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
> start: 0x7f4921400000
> end: 0x7f4f4c000000
>
> so it fits into that range (that was my mistate earlier, using just
> grep not checking are they really within that), but...
>
>> Maybe there were not enough huge pages left on one of the nodes?
>
> ad b) right, something like that. I've investigated that SIGBUS there
> (it's going to be long):
>
> with shared_buffers=32GB, huge_pages 17715 (+1 from what postgres -C
> shared_memory_size_in_huge_pages returns), right after startup, but no
> touch:
>
> Program received signal SIGBUS, Bus error.
> pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
> ../contrib/pg_buffercache/pg_buffercache_pages.c:386
> 386 pg_numa_touch_mem_if_required(ptr);
> (gdb) where
> #0 pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
> ../contrib/pg_buffercache/pg_buffercache_pages.c:386
> #1 0x00005571f54ddb7d in ExecMakeTableFunctionResult
> (setexpr=0x557203870d40, econtext=0x557203870ba8,
> argContext=<optimized out>, expectedDesc=0x557203870f80,
> randomAccess=false) at ../src/backend/executor/execSRF.c:234
> [..]
> (gdb) print ptr
> $1 = 0x7f6cf8400000 <error: Cannot access memory at address 0x7f6cf8400000>
> (gdb)
>
>
> then it shows?! no available hugepage on one of the nodes (while gdb
> is hanging and preving autorestart):
>
> root(at)swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
> node0/meminfo:Node 0 HugePages_Free: 299
> node1/meminfo:Node 1 HugePages_Free: 299
> node2/meminfo:Node 2 HugePages_Free: 299
> node3/meminfo:Node 3 HugePages_Free: 0
>
> but they are equal in terms of size:
> node0/meminfo:Node 0 HugePages_Total: 4429
> node1/meminfo:Node 1 HugePages_Total: 4429
> node2/meminfo:Node 2 HugePages_Total: 4429
> node3/meminfo:Node 3 HugePages_Total: 4428
>
> smaps shows that this address (7f6cf8400000) is mapped in this mapping:
> 7f6b49c00000-7f6d49c00000 rw-s 652600000 00:11 86064
> /anon_hugepage (deleted)
>
> numa_maps for this region shows this is this mapping on node3 (notice
> N3 + bind:3 matches lack of memory on Node 3 HugePAges_Free):
> 7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
> N3=3444 kernelpagesize_kB=2048
>
> the surrounding area of this looks like that:
>
> 7f6549c00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=4096
> N0=4096 kernelpagesize_kB=2048
> 7f6749c00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=4096
> N1=4096 kernelpagesize_kB=2048
> 7f6949c00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=4096
> N2=4096 kernelpagesize_kB=2048
> 7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
> N3=3444 kernelpagesize_kB=2048 <-- this is the one
> 7f6d49c00000 default file=/anon_hugepage\040(deleted) huge dirty=107
> mapmax=6 N3=107 kernelpagesize_kB=2048
>
> Notice it's just N3=3444, while the others are much larger. So
> something was using that hugepages memory on N3:
>
> # grep kernelpagesize_kB=2048 /proc/1679/numa_maps | grep -Po
> N[0-4]=[0-9]+ | sort
> N0=2
> N0=4096
> N1=2
> N1=4096
> N2=2
> N2=4096
> N3=1
> N3=1
> N3=1
> N3=1
> N3=107
> N3=13
> N3=3
> N3=3444
>
> So per above it's not there (at least not as 2MB HP). But the number
> of mappings is wild there! (node where it is failing has plenty of
> memory, no hugepage memory left, but it has like 40k+ of small
> mappings!)
>
> # grep -Po 'N[0-3]=' /proc/1679/numa_maps | sort | uniq -c
> 17 N0=
> 10 N1=
> 3 N2=
> 40434 N3=
>
> most of them are `anon_inode:[io_uring]` (and I had
> max_connections=10k). You may ask why in spite of Andres optimization
> for reducing number segments for uring, it's not working for me ? Well
> I've just noticed way too silent failure to active this (altough I'm
> on 6.14.x):
> 2025-11-06 13:34:49.128 CET [1658] DEBUG: can't use combined
> memory mapping for io_uring, kernel or liburing too old
> and I dont have io_uring_queue_init_mem()/HAVE_LIBURING_QUEUE_INIT_MEM
> apparently on liburing-2.3 (Debian's default). See [1] for more info
> (fix is not commited yet sadly).
>
> Next try, now with io_method = worker and right before start:
>
> root(at)swiatowid:/sys/devices/system/node# grep -r -i HugePages_Total
> node*/meminfo
> node0/meminfo:Node 0 HugePages_Total: 4429
> node1/meminfo:Node 1 HugePages_Total: 4429
> node2/meminfo:Node 2 HugePages_Total: 4429
> node3/meminfo:Node 3 HugePages_Total: 4428
> and HugePages_Free were 100% (if postgresql was down). After start
> (but without doing anything else):
> root(at)swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
> node0/meminfo:Node 0 HugePages_Free: 4393
> node1/meminfo:Node 1 HugePages_Free: 4395
> node2/meminfo:Node 2 HugePages_Free: 4395
> node3/meminfo:Node 3 HugePages_Free: 3446
>
> So sadly the picture is the same (something stole my HP on N3 and it's
> PostgreSQL on it's own). After some time of investigating that ("who
> stole my hugepage across whole OS"), I've just added MAP_POPULATE to
> the mix of PG_MMAP_FLAGS and got this after start:
>
> root(at)swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
> node0/meminfo:Node 0 HugePages_Free: 0
> node1/meminfo:Node 1 HugePages_Free: 0
> node2/meminfo:Node 2 HugePages_Free: 0
> node3/meminfo:Node 3 HugePages_Free: 1
>
> and then the SELECT to pg_buffercache_numa works fine(!).
>
> Another ways that I have found to eliminate that SIGBUS
> a. Would be to throw much more HugePages (so that node does not run to
> HugePages_Free), but that's not real option.
> b. Then I've reminded myself that I could be running custom kernel
> with experimental CONFIG_READ_ONLY_THP_FOR_FS (to reduce iTLB misses
> tranparently with specially linked PG; will double check exact stuff
> later), so I've thrown never into
> /sys/kernel/mm/transparent_hugepage/enabled and defrag too (yes ,
> disabled THP) and with that -- drumroll -- that SELECT works. The very
> same PG picture after startup (where earlier it would crash), now
> after SELECT it looks like that:
>
> root(at)swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
> node0/meminfo:Node 0 HugePages_Free: 83
> node1/meminfo:Node 1 HugePages_Free: 0
> node2/meminfo:Node 2 HugePages_Free: 81
> node3/meminfo:Node 3 HugePages_Free: 82
>
> Hope that helps a little. To me it sounds like THP used that memory
> somehow and we've also wanted to use. With numa_interleave_ptr() that
> wouldn't be a problem because probably it would something else
> available, but not here as we indicated exact node.
>
>>> 0006e:
>>> I'm seeking confirmation, but is this the issue we have discussed
>>> on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
>>> $ numactl --membind="0,1" --cpunodebind="0,1"
>>> /usr/pgsql19/bin/pg_ctl -D /path start
>>> still shows 4 NUMA nodes used. Current patches use
>>> numa_num_configured_nodes(), but it says 'This count includes any
>>> nodes that are currently DISABLED'. So I was wondering if I could help
>>> by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
>>> It's the same as You wrote earlier to Alexy?
>>>
>>
>> If "mems_allowed" refers to nodes allowing memory allocation, then yes,
>> this would be one way to get into that issue. Oh, is this what happened
>> in 0006d?
>
> OK, thanks for confirmation. No, 0006d was about normal numactl run,
> without --membind.
>

I didn't have time to look into all this info about mappings, io_uring
yet, so no response from me.

>> I did get a couple of "operation canceled" failures, but only on fairly
>> old kernel versions (6.1 which came as default with the VM).
>
> OK, I'll try to see that later too.
>
> btw QQ regarding partitioned clockwise as I had thought: does this
> opens a road towards multiple BGwriters? (outside of this
> $thread/v1/PoC)
>

I don't think the clocksweep partitioning is required for multiple
bgwriters, but it might make it easier.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
v20251111-0007-NUMA-partition-PGPROC.patch	text/x-patch	49.2 KB
v20251111-0006-NUMA-shared-buffers-partitioning.patch	text/x-patch	44.0 KB
v20251111-0005-clock-sweep-weighted-balancing.patch	text/x-patch	5.2 KB
v20251111-0004-clock-sweep-scan-all-partitions.patch	text/x-patch	6.7 KB
v20251111-0003-clock-sweep-balancing-of-allocations.patch	text/x-patch	25.3 KB
v20251111-0002-clock-sweep-basic-partitioning.patch	text/x-patch	33.9 KB
v20251111-0001-Infrastructure-for-partitioning-shared-buf.patch	text/x-patch	14.9 KB

In response to

Re: Adding basic NUMA awareness at 2025-11-06 14:02:57 from Jakub Wartak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Matthias van de Meent	2025-11-11 12:06:30	Re: [WiP] B-tree page merge during vacuum to reduce index bloat
Previous Message	Álvaro Herrera	2025-11-11 11:41:25	CURL_IGNORE_DEPRECATION