Quick Links

Re: Adding basic NUMA awareness

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Adding basic NUMA awareness
Date:	2025-10-15 17:02:38
Message-ID:	72506e5a-92ac-47fb-9d3a-3b7b420af985@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Here's an updated patch series, addressing (some) of the issues. I've
kept the changes in separate patches, to make the changes easier to
review and discuss.

On 10/13/25 20:34, Alexey Makhmutov wrote:
> On 10/13/25 14:09, Tomas Vondra wrote:
>
>> I'm not sure I understand. Are you suggesting there's a bug in the
> patch, the kernel, or somewhere else?
>
> We need to ensure that both addr and (addr + size) are aligned to the
> page size of the target mapping during 'numa_tonode_memory' invocation,
> otherwise it may produce unexpected results.
>

Hmm. The libnuma docs about numa_intereave_memory says:

... The size argument will be rounded up to a multiple of the system
page size. ...

Which I interpreted that it does all the necessary rounding. But if this
ignores huge pages (i.e. "system page size" is 4K, not a HP size), then
aligning the size explicitly is would be needed.

This would be pretty annoying, though. It'd mean we can't rely on any
rounding done by libnuma, at least for code that might use huge pages.

Nevertheless, the updated patches should address both cases ...

>> But this is exactly why (with hugepages) the code aligns everything to
> huge page boundary, and sizes everything as a multiple of huge page. At
> least I think so. Maybe I remember wrong?
>
> I assume that there are places in the current patch, which could perform
> such unaligned mapping. See below for samples.
>
>> Can you actually demonstrate this?
>
> This issue is related to the calculation of partition size for buffer
> descriptors in case we have multiple partitions per node. Currently we
> ensure that each node gets number of buffers, which fits into whole
> memory pages, but if we have several partitions per node, then there is
> no guarantee that partition size will be properly aligned for
> descriptors. We could observe this problem only if we have multiple
> partitions per node and with MIN_BUFFER_PARTITIONS equal to 4, this
> issue can potentially affect only configurations with 2 or 3 nodes.
>
> Two examples here: first, let's assume we want to have shared_buffers
> set to 32GB with 3 NUMA nodes and 2MB pages. The NBuffers will be
> 4,194,304, min_node_buffers will be 32,768 and num_partitions_per_node
> will be 2 (so, 6 partitions in total). NBuffers/min_node_buffers = 128,
> so the nearest multiplier for min_node_buffers which allow us to cover
> all buffers with 3 nodes is 43 (42*3 = 126, 43*3 = 129). The
> num_buffers_per_node is 43*min_node_buffers and it is aligned to page
> size, but we need to split it between two partitions, so each gets
> 41.5*min_node_buffers buffers. This still allow us to split buffers
> itself by page boundary, but descriptor partitions will be split just in
> the middle of the page. Here is the log for such configuration:
> NUMA: buffers 4194304 partitions 6 num_nodes 3 per_node 2
> buffers_per_node 1409024 (min 32768)
> ...

I see. I was really puzzled how could a node get chunk of buffers that's
not a multiple of page size, because min_node_buffers was meant to
guarantee that. But now I realize it's not about per-node buffers, it's
about individual partitions.

Initially I thought the right way to fix this is to use min_node_buffers
for each partitions, not for nodes. But that would increase the amount
of memory needed for NUMA partitioning to work. I practice that wouldn't
be an issue, because it'd still be only ~1GB (with 2MB huge pages), and
the relevant systems will have way more.

But then I realized it's we don't need to map the partitions one by one.
We can simply map all partitions for the whole NUMA node at once, and
then we don't have this problem at all.

The attached 0007 patch does this to fix the issue. And I just noticed
this is pretty much exactly how you fixed this in your commit ee8b360.

The last partition may still not have the size aligned, though, because
may not be a multiple of min_node_buffers.

>
> Another example: 2 nodes and 15872MB shared_buffers. Again, NBuffers/
> min_node_buffers=62, so num_buffers_per_node is 31*min_node_buffers,
> which gives each partition 15.5*min_node_buffers. Here is the log output:
> NUMA: buffers 2031616 partitions 4 num_nodes 2 per_node 2
> buffers_per_node 1015808 (min 32768)
> ...
> mbind: Invalid argument
> NUMA: buffer_partitions_init: 3 => 1 buffers 507904 start 0x7ffee1c00000
> end 0x7fffd9c00000 (size 4160749568)
> NUMA: buffer_partitions_init: 3 => 1 descriptors 507904 start
> 0x7ffbf7b00000 end 0x7ffbf9a00000 (size 32505856)
> mbind: Invalid argument
>
>> So you're saying pgproc_partition_init() should not do just this
>> ptr = (char *) ptr + num_procs * sizeof(PGPROC);
>> but align the pointer to numa_page_size too? Sounds reasonable.
>
> Yes, that's exactly my point, otherwise we could violate the alignment
> rule for 'numa_tonode_memory'. Here is an extraction from the log for
> system with 2 nodes, 2000 max_connections and 2MB pages:

Should be fixed by 0010 by explicitly aligning the size like this. It's
a bit more extensive than your eaf1277.

BTW what's the mbind failures about? Is that something we check, at
least in memory

>
>> I don't think the memset() is a problem. Yes, it might map it to the
> current node, but so what - the numa_tonode_memory() will just move it
> to the correct one.
>
> Well, the 'numa_tonode_memory' call does not move pages to the target
> node. It just sets the policy for mapping, so system will actually try
> to provide page from the correct node once we touch it. However, if the
> page is already faulted, then it won't be affected by this mapping, so
> that's why it works faster compared to 'numa_move_pages'. As stated in
> libnuma documentation:
> * numa_tonode_memory() put memory on a specific node. The constraints
> described for numa_interleave_memory() apply here too.
> * numa_interleave_memory() interleaves size bytes of memory page by
> page from start on nodes specified in nodemask. <...> This is a lower
> level function to interleave allocated but not yet faulted in memory.
> Not yet faulted in means the memory is allocated using mmap(2) or
> shmat(2), but has not been accessed by the current process yet. <...>
> If the numa_set_strict() flag is true then the operation will cause a
> numa_error if there were already pages in the mapping that do not follow
> the policy.
>

Point taken. The 0009 fixes this by moving the MemSet() to after the
partitioning. At that point the policy is already set.

There's a couple more fixes. 0008 improves handling of cases that don't
allow NUMA partitioning (like when shared_buffers are too small). 0011
adds the missing padding to PGProcShmemSize, which you also fixed in one
of your commits.

0012 reduces logging in clock-sweep balancing, which on idle systems was
annoyingly verbose.

I keps 0006 separate for now. It got broken by 5e89985928, and the
conflicts were fairly extensive. Better keep it separate a bit longer.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
v20251015-0001-NUMA-shared-buffers-partitioning.patch	text/x-patch	41.7 KB
v20251015-0002-NUMA-clockweep-partitioning.patch	text/x-patch	35.6 KB
v20251015-0003-NUMA-clocksweep-allocation-balancing.patch	text/x-patch	25.3 KB
v20251015-0004-NUMA-weighted-clocksweep-balancing.patch	text/x-patch	5.1 KB
v20251015-0005-NUMA-partition-PGPROC.patch	text/x-patch	48.7 KB
v20251015-0006-fix-StrategyGetBuffer.patch	text/x-patch	6.6 KB
v20251015-0007-fix-map-all-buffer-partitions-for-NUMA-nod.patch	text/x-patch	4.3 KB
v20251015-0008-fix-handle-disabled-NUMA-partitioning-of-b.patch	text/x-patch	2.9 KB
v20251015-0009-fix-move-memset-after-PGPROC-partitioning.patch	text/x-patch	1.4 KB
v20251015-0010-fix-pgproc-partitioning-align-size.patch	text/x-patch	3.2 KB
v20251015-0011-fix-add-padding-for-pgproc-partitions.patch	text/x-patch	847 bytes
v20251015-0012-fix-clock-sweep-log-level.patch	text/x-patch	1.1 KB

In response to

Re: Adding basic NUMA awareness at 2025-10-13 18:34:39 from Alexey Makhmutov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2025-10-15 17:15:47	Re: Adding basic NUMA awareness
Previous Message	Dmitry Koval	2025-10-15 16:59:01	Re: Add SPLIT PARTITION/MERGE PARTITIONS commands