Re: Adding basic NUMA awareness

From: Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Adding basic NUMA awareness
Date: 2025-10-12 23:58:38
Message-ID: bf95094a-77c2-46cf-913a-443f7419bc79@postgrespro.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

Thank you very much for working on this problem and the entire line of
patches prepared! I've tried to play with these patches a little and
here are some my observations and suggestions.

In the current implementation we try to use all available NUMA nodes on
the machine, however it's often useful to limit the database only to a
set of specific nodes, so that other nodes can be used for other
processes. In my testing I was trying to use one node out of four for
the client program, so I'd liked to limit the database to the remaining
nodes. I use a systemd service with AllowedMemoryNodes/AllowedCPUs to
start the cluster, so the obvious choice for me was to use the
'numa_get_membind' function instead of 'numa_num_configured_nodes' to
get the list of usable nodes. However, it is much easier to work with
logical nodes in the [0; n] range inside the PG code, so I've decided to
add mapping between 'logical nodes' (0-n in PG) to a set of physical
nodes actually returned by 'numa_get_membind'. We may need to map number
in both directions, so two translation tables are allocated and filled
at the first usage of 'pg_numa' functions. It also seems to be a good
idea to isolate all 'libnuma' calls inside 'pg_numa.c', so to keep all
'numa_...' calls in it and this also allows us to hide this mapping in
static functions. Here is the patch, which I've used to test this idea:
https://github.com/Lerm/postgres/commit/9ec625c2bf564f5432375ec1d7ad02e4b2559161.
This idea probably could be extended by adding some view to expose this
mapping to the user (at least for testing purposes) and allow to
explicitly override this mapping with a GUC setting. With such GUC
setting we would be able to control PG memory usage on NUMA nodes
without the need for systemd resource control or numactl parameters.

Next, I've noticed some problems related to the size alignment for
'numa_tonode_memory' call in 'pg_numa_move_to_node' function. The
documentation for the 'numa_tonode_memory' says that 'The size
argument will be rounded up to a multiple of the system page size'.
However this does not work well with huge pages as alignment is
performed for the default kernel page size (i.e. 4K in most cases). If
addr + size value (rounded to the default page size) does not cover the
entire huge page, then such invocation seems to be processed incorrectly
and allocation policy is not applied for next pages access in such
segment. At least this was the behavior I've observed on Debian 12 /
6.1.40 kernel (i.e. '/proc/<pid>/numa_maps' shows that the segment
contains pages from wrong nodes).

There are two location at which we could face such situation in current
patches. First is related to buffers partitions mapping. With current
code we basically ensure that combined size of all partitions for a
single node is aligned to (huge) page size (as size is bound to the
number of descriptors on one page), but individual partition is not
explicitly aligned to this size. So, we could get the situation in which
single page is split between adjacent partitions (e.g. 32GB buffers
split by 3 nodes). With current code we will try to map each partition
independently, which will results in unaligned calls to
'numa_tonode_memory', so resulting mapping will be incorrect. We could
either try to choose size for individual partition to align it to the
desired page size or map all the partitions for a single node using a
single 'pg_numa_move_to_node' invocation. During testing I've used the
second approach, so here is the change to implement such logic:
https://github.com/Lerm/postgres/commit/ee8b3603afd6d89e67b755dadc8e4c25ffba88be.

The second location which could expose the same problem is related to
the mapping of PGPROC arrays in 'pgproc_partition_init': here we need to
align pointer to the end PGPROC partition. There seems to be also two
additional problems with PGPROC partitioning: we need to account
additional padding pages in 'PGProcShmemSize' (using the same logic as
with fplocks) and we should not call 'MemSet(ptr, 0, ...)' prior to
partitions mapping call (otherwise it will be mapped to current node).
Here is a potential change, which tries to address these problems:
https://github.com/Lerm/postgres/commit/eaf12776f59ff150735d0f187595fc8ce3f0a872.

There are also some potential problems with buffers distribution between
nodes. I have a feeling that current logic in
'buffer_partitions_prepare' does not work correctly if number of buffers
is enough to cover just a single partition per node, but total number of
nodes is below MIN_BUFFER_PARTITIONS (i.e. 2 or 3). In this case we will
set 'numa_can_partition' to 'true', but will allocate 2 partitions per
node (so, 4 or 6 partitions in total), while we can fill just 2 or 3
partition and leaving remaining partitions empty. This should violate
the last assert check, as last partition will get zero buffers in this
case. Another issue is related to the usage of 1GB pages, as minimal
size for buffers partitions is limited by the minimal number of buffer
descriptors in a single page. For 2MB pages this gives 2097152 / 64 * 8K
= 256M as minimal size for partition, but for 1GB page the minimal size
is equal to 1GB / 64 * 8K = 128GB. So, if we assume 4 as minimal number
of partitions, then for 2MB pages we need just 1GB for shared_buffers to
enable partitioning (which seems a perfectly fine minimal limit for most
cases), but for 1GB pages we need to allocate at least 512GB to allow
buffers partitioning. Certainly, 1GB pages are usually used on large
machines with large number of buffers allocated, but still it may be
useful to allow configurations with 32GB or 64GB buffer cache to use
both 1GB pages and buffers partitioning at the same time. However, I
don't see an easy way to achieve this with the current logic. We either
need to allow usage of different page sizes here (i.e. 2MB for
descriptors and 1GB for buffers) or combine both buffers and its
descriptors in a single object (i.e. 'buffer chunk', which cover enough
buffers and their descriptors to fit into one or several memory pages),
effectively replacing both buffers and descriptors arrays with an array
of such 'chunks'. The latter solution may also help with dynamic buffer
cache resizing (as we may just add additional 'chunks' in this case) and
also increase TLB-hits with 1GB page (as both descriptor and its buffer
will be likely located in the same page). However, both these changes
seems to be quite large.

I've tried also to run some benchmarks on my server: I've got some
improvements in 'pgbench/tpcb-like'results - about 8%, but only with
backends pinning to NUMA node (i.e. adjusting your previous pinning
patch to 'debug_numa' GUC:
https://github.com/Lerm/postgres/commit/5942a3e12c7c501aa9febb63972a039e7ce00c20)
For 'select-only' scenario the gain is more substantial (about 15%), but
these tests are tricky, as they are more sensitive to other server
settings and specific functions layout in compiled code, so they need
more checks.

Thank you again for sharing these patches!

Thanks,
Alexey

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Treat 2025-10-13 00:03:18 Re: Adding REPACK [concurrently]
Previous Message David Rowley 2025-10-12 23:42:37 Re: Improve docs for n_distinct_inherited