Quick Links

Re: Adding basic NUMA awareness

From:	Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Adding basic NUMA awareness
Date:	2025-10-13 18:34:39
Message-ID:	92e23c85-f646-4bab-b5e0-df30d8ddf4bd@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 10/13/25 14:09, Tomas Vondra wrote:

> I'm not sure I understand. Are you suggesting there's a bug in the
patch, the kernel, or somewhere else?

We need to ensure that both addr and (addr + size) are aligned to the
page size of the target mapping during 'numa_tonode_memory' invocation,
otherwise it may produce unexpected results.

> But this is exactly why (with hugepages) the code aligns everything
to huge page boundary, and sizes everything as a multiple of huge page.
At least I think so. Maybe I remember wrong?

I assume that there are places in the current patch, which could perform
such unaligned mapping. See below for samples.

> Can you actually demonstrate this?

This issue is related to the calculation of partition size for buffer
descriptors in case we have multiple partitions per node. Currently we
ensure that each node gets number of buffers, which fits into whole
memory pages, but if we have several partitions per node, then there is
no guarantee that partition size will be properly aligned for
descriptors. We could observe this problem only if we have multiple
partitions per node and with MIN_BUFFER_PARTITIONS equal to 4, this
issue can potentially affect only configurations with 2 or 3 nodes.

Two examples here: first, let's assume we want to have shared_buffers
set to 32GB with 3 NUMA nodes and 2MB pages. The NBuffers will be
4,194,304, min_node_buffers will be 32,768 and num_partitions_per_node
will be 2 (so, 6 partitions in total). NBuffers/min_node_buffers = 128,
so the nearest multiplier for min_node_buffers which allow us to cover
all buffers with 3 nodes is 43 (42*3 = 126, 43*3 = 129). The
num_buffers_per_node is 43*min_node_buffers and it is aligned to page
size, but we need to split it between two partitions, so each gets
41.5*min_node_buffers buffers. This still allow us to split buffers
itself by page boundary, but descriptor partitions will be split just in
the middle of the page. Here is the log for such configuration:
NUMA: buffers 4194304 partitions 6 num_nodes 3 per_node 2
buffers_per_node 1409024 (min 32768)
NUMA: buffer 0 node 0 partition 0 buffers 704512 first 0 last 704511
NUMA: buffer 1 node 0 partition 1 buffers 704512 first 704512 last 1409023
NUMA: buffer 2 node 1 partition 0 buffers 704512 first 1409024 last 2113535
NUMA: buffer 3 node 1 partition 1 buffers 704512 first 2113536 last 2818047
NUMA: buffer 4 node 2 partition 0 buffers 688128 first 2818048 last 3506175
NUMA: buffer 5 node 2 partition 1 buffers 688128 first 3506176 last 4194303
NUMA: buffer_partitions_init: 0 => 0 buffers 704512 start 0x7ff7c8c00000
end 0x7ff920c00000 (size 5771362304)
NUMA: buffer_partitions_init: 0 => 0 descriptors 704512 start
0x7ff7b8a00000 end 0x7ff7bb500000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 1 => 0 buffers 704512 start 0x7ff920c00000
end 0x7ffa78c00000 (size 5771362304)
NUMA: buffer_partitions_init: 1 => 0 descriptors 704512 start
0x7ff7bb500000 end 0x7ff7be000000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 2 => 1 buffers 704512 start 0x7ffa78c00000
end 0x7ffbd0c00000 (size 5771362304)
NUMA: buffer_partitions_init: 2 => 1 descriptors 704512 start
0x7ff7be000000 end 0x7ff7c0b00000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 3 => 1 buffers 704512 start 0x7ffbd0c00000
end 0x7ffd28c00000 (size 5771362304)
NUMA: buffer_partitions_init: 3 => 1 descriptors 704512 start
0x7ff7c0b00000 end 0x7ff7c3600000 (size 45088768)
mbind: Invalid argument
NUMA: buffer_partitions_init: 4 => 2 buffers 688128 start 0x7ffd28c00000
end 0x7ffe78c00000 (size 5637144576)
NUMA: buffer_partitions_init: 4 => 2 descriptors 688128 start
0x7ff7c3600000 end 0x7ff7c6000000 (size 44040192)
NUMA: buffer_partitions_init: 5 => 2 buffers 688128 start 0x7ffe78c00000
end 0x7fffc8c00000 (size 5637144576)
NUMA: buffer_partitions_init: 5 => 2 descriptors 688128 start
0x7ff7c6000000 end 0x7ff7c8a00000 (size 44040192)

Another example: 2 nodes and 15872MB shared_buffers. Again,
NBuffers/min_node_buffers=62, so num_buffers_per_node is
31*min_node_buffers, which gives each partition 15.5*min_node_buffers.
Here is the log output:
NUMA: buffers 2031616 partitions 4 num_nodes 2 per_node 2
buffers_per_node 1015808 (min 32768)
NUMA: buffer 0 node 0 partition 0 buffers 507904 first 0 last 507903
NUMA: buffer 1 node 0 partition 1 buffers 507904 first 507904 last 1015807
NUMA: buffer 2 node 1 partition 0 buffers 507904 first 1015808 last 1523711
NUMA: buffer 3 node 1 partition 1 buffers 507904 first 1523712 last 2031615
NUMA: buffer_partitions_init: 0 => 0 buffers 507904 start 0x7ffbf9c00000
end 0x7ffcf1c00000 (size 4160749568)
NUMA: buffer_partitions_init: 0 => 0 descriptors 507904 start
0x7ffbf1e00000 end 0x7ffbf3d00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 1 => 0 buffers 507904 start 0x7ffcf1c00000
end 0x7ffde9c00000 (size 4160749568)
NUMA: buffer_partitions_init: 1 => 0 descriptors 507904 start
0x7ffbf3d00000 end 0x7ffbf5c00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 2 => 1 buffers 507904 start 0x7ffde9c00000
end 0x7ffee1c00000 (size 4160749568)
NUMA: buffer_partitions_init: 2 => 1 descriptors 507904 start
0x7ffbf5c00000 end 0x7ffbf7b00000 (size 32505856)
mbind: Invalid argument
NUMA: buffer_partitions_init: 3 => 1 buffers 507904 start 0x7ffee1c00000
end 0x7fffd9c00000 (size 4160749568)
NUMA: buffer_partitions_init: 3 => 1 descriptors 507904 start
0x7ffbf7b00000 end 0x7ffbf9a00000 (size 32505856)
mbind: Invalid argument

> So you're saying pgproc_partition_init() should not do just this
> ptr = (char *) ptr + num_procs * sizeof(PGPROC);
> but align the pointer to numa_page_size too? Sounds reasonable.

Yes, that's exactly my point, otherwise we could violate the alignment
rule for 'numa_tonode_memory'. Here is an extraction from the log for
system with 2 nodes, 2000 max_connections and 2MB pages:
NUMA: pgproc backends 2056 num_nodes 2 per_node 1028
NUMA: pgproc_init_partition procs 0x7fffe7800000 endptr 0x7fffe78d2d20
num_procs 1028 node 0
mbind: Invalid argument
NUMA: pgproc_init_partition procs 0x7fffe7a00000 endptr 0x7fffe7ad2d20
num_procs 1028 node 1
mbind: Invalid argument
NUMA: pgproc_init_partition procs 0x7fffe7c00000 endptr 0x7fffe7c07cb0
num_procs 38 node -1
mbind: Invalid argument
mbind: Invalid argument

> I don't think the memset() is a problem. Yes, it might map it to the
current node, but so what - the numa_tonode_memory() will just move it
to the correct one.

Well, the 'numa_tonode_memory' call does not move pages to the target
node. It just sets the policy for mapping, so system will actually try
to provide page from the correct node once we touch it. However, if the
page is already faulted, then it won't be affected by this mapping, so
that's why it works faster compared to 'numa_move_pages'. As stated in
libnuma documentation:
* numa_tonode_memory() put memory on a specific node. The constraints
described for numa_interleave_memory() apply here too.
* numa_interleave_memory() interleaves size bytes of memory page by
page from start on nodes specified in nodemask. <...> This is a lower
level function to interleave allocated but not yet faulted in memory.
Not yet faulted in means the memory is allocated using mmap(2) or
shmat(2), but has not been accessed by the current process yet. <...>
If the numa_set_strict() flag is true then the operation will cause a
numa_error if there were already pages in the mapping that do not follow
the policy.

I assume, that for the regular page kernel may rebalance memory in the
future (not immediately), but not for hugepages. So, we really don't
want to touch the memory area before we call the 'numa_tonode_memory'.

This can be easily tested with the simple program:
#include <stdio.h>
#include <numa.h>
#include <sys/mman.h>
#include <linux/mman.h>

#define MAP_SIZE 2*1024*1024

int main(int argc, char** argv) {
void* ptr1 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);
void* ptr2 = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);

/* Fault first page */
memset(ptr1, 1, MAP_SIZE);
/* Move to node 1 */
numa_tonode_memory(ptr1, MAP_SIZE, 1);
numa_tonode_memory(ptr2, MAP_SIZE, 1);
/* Fault second page */
memset(ptr2, 1, MAP_SIZE);

/* Wait */
printf("ptr1=%lx\nptr2=\%lx\nPress Enter to continue...\n",ptr1,ptr2);
getchar();
munmap(ptr2, MAP_SIZE);
munmap(ptr1, MAP_SIZE);
return 0;
}

Running it on the first node:
# gcc -o test_mem test_mem.c -lnuma
# taskset -c 0 ./test_mem
ptr1=7ffff7a00000
ptr2=7ffff7800000
Press Enter to continue...

From another terminal:
# grep huge /proc/`pgrep test_mem`/numa_maps
7ffff7800000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N1=1
kernelpagesize_kB=2048
7ffff7a00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=1 N0=1
kernelpagesize_kB=2048

So, while policy (bind:1) is set for both mappings, but only the second
one (which was not touched before the 'numa_tonode_memory' invocation)
is actualy located on node 1 rather than 0.

> What kind of hardware was that? What/how many cpus, NUMA nodes, how
much memory, what storage?

Of course, that's valid question. I probably should not have commented
on performance side without providing full data, while I was still
trying to measure it and it was just preliminary runs. Sorry for that.

Thanks,
Alexey

In response to

Re: Adding basic NUMA awareness at 2025-10-13 11:09:20 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-10-13 19:30:48	Re: Clarification on Role Access Rights to Table Indexes
Previous Message	Bryan Green	2025-10-13 18:24:06	Re: [PATCH] Fix incorrect fprintf usage in log_error FRONTEND path