Quick Links

Adding basic NUMA awareness

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Adding basic NUMA awareness
Date:	2025-07-01 19:07:00
Message-ID:	099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

This is a WIP version of a patch series I'm working on, adding some
basic NUMA awareness for a couple parts of our shared memory (shared
buffers, etc.). It's based on Andres' experimental patches he spoke
about at pgconf.eu 2024 [1], and while it's improved and polished in
various ways, it's still experimental.

But there's a recent thread aiming to do something similar [2], so
better to share it now so that we can discuss both approaches. This
patch set is a bit more ambitious, handling NUMA in a way to allow
smarter optimizations later, so I'm posting it in a separate thread.

The series is split into patches addressing different parts of the
shared memory, starting (unsurprisingly) from shared buffers, then
buffer freelists and ProcArray. There's a couple additional parts, but
those are smaller / addressing miscellaneous stuff.

Each patch has a numa_ GUC, intended to enable/disable that part. This
is meant to make development easier, not as a final interface. I'm not
sure how exactly that should look. It's possible some combinations of
GUCs won't work, etc.

Each patch should have a commit message explaining the intent and
implementation, and then also detailed comments explaining various
challenges and open questions.

But let me go over the basics, and discuss some of the design choices
and open questions that need solving.

1) v1-0001-NUMA-interleaving-buffers.patch

This is the main thing when people think about NUMA - making sure the
shared buffers are allocated evenly on all the nodes, not just on a
single node (which can happen easily with warmup). The regular memory
interleaving would address this, but it also has some disadvantages.

Firstly, it's oblivious to the contents of the shared memory segment,
and we may not want to interleave everything. It's also oblivious to
alignment of the items (a buffer can easily end up "split" on multiple
NUMA nodes), or relationship between different parts (e.g. there's a
BufferBlock and a related BufferDescriptor, and those might again end up
on different nodes).

So the patch handles this by explicitly mapping chunks of shared buffers
to different nodes - a bit like interleaving, but in larger chunks.
Ideally each node gets (1/N) of shared buffers, as a contiguous chunk.

It's a bit more complicated, because the patch distributes both the
blocks and descriptors, in the same way. So a buffer and it's descriptor
always end on the same NUMA node. This is one of the reasons why we need
to map larger chunks, because NUMA works on page granularity, and the
descriptors are tiny - many fit on a memory page.

There's a secondary benefit of explicitly assigning buffers to nodes,
using this simple scheme - it allows quickly determining the node ID
given a buffer ID. This is helpful later, when building freelist.

The patch is fairly simple. Most of the complexity is about picking the
chunk size, and aligning the arrays (so that it nicely aligns with
memory pages).

The patch has a GUC "numa_buffers_interleave", with "off" by default.

2) v1-0002-NUMA-localalloc.patch

This simply sets "localalloc" when initializing a backend, so that all
memory allocated later is local, not interleaved. Initially this was
necessary because the patch set the allocation policy to interleaving
before initializing shared memory, and we didn't want to interleave the
private memory. But that's no longer the case - the explicit mapping to
nodes does not have this issue. I'm keeping the patch for convenience,
it allows experimenting with numactl etc.

The patch has a GUC "numa_localalloc", with "off" by default.

3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch

Minor optimization. Andres noticed we're tracking the tail of buffer
freelist, without using it. So the patch removes that.

4) v1-0004-NUMA-partition-buffer-freelist.patch

Right now we have a single freelist, and in busy instances that can be
quite contended. What's worse, the freelist may trash between different
CPUs, NUMA nodes, etc. So the idea is to have multiple freelists on
subsets of buffers. The patch implements multiple strategies how the
list can be split (configured using "numa_partition_freelist" GUC), for
experimenting:

* node - One list per NUMA node. This is the most natural option,
because we now know which buffer is on which node, so we can ensure a
list for a node only has buffers from that list.

* cpu - One list per CPU. Pretty simple, each CPU gets it's own list.

* pid - Similar to "cpu", but the processes are mapped to lists based on
PID, not CPU ID.

* none - nothing, sigle freelist

Ultimately, I think we'll want to go with "node", simply because it
aligns with the buffer interleaving. But there are improvements needed.

The main challenge is that with multiple smaller lists, a process can't
really use the whole shared buffers. So a single backed will only use
part of the memory. The more lists there are, the worse this effect is.
This is also why I think we won't use the other partitioning options,
because there's going to be more CPUs than NUMA nodes.

Obviously, this needs solving even with NUMA nodes - we need to allow a
single backend to utilize the whole shared buffers if needed. There
should be a way to "steal" buffers from other freelists (if the
"regular" freelist is empty), but the patch does not implement this.
Shouldn't be hard, I think.

The other missing part is clocksweep - there's still just a single
instance of clocksweep, feeding buffers to all the freelists. But that's
clearly a problem, because the clocksweep returns buffers from all NUMA
nodes. The clocksweep really needs to be partitioned the same way as a
freelists, and each partition will operate on a subset of buffers (from
the right NUMA node).

I do have a separate experimental patch doing something like that, I
need to make it part of this branch.

5) v1-0005-NUMA-interleave-PGPROC-entries.patch

Another area that seems like it might benefit from NUMA is PGPROC, so I
gave it a try. It turned out somewhat challenging. Similarly to buffers
we have two pieces that need to be located in a coordinated way - PGPROC
entries and fast-path arrays. But we can't use the same approach as for
buffers/descriptors, because

(a) Neither of those pieces aligns with memory page size (PGPROC is
~900B, fast-path arrays are variable length).

(b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
rather high max_connections before we use multiple huge pages.

The fast-path arrays are less of a problem, because those tend to be
larger, and are accessed through pointers, so we can just adjust that.

So what I did instead is splitting the whole PGPROC array into one array
per NUMA node, and one array for auxiliary processes and 2PC xacts. So
with 4 NUMA nodes there are 5 separate arrays, for example. Each array
is a multiple of memory pages, so we may waste some of the memory. But
that's simply how NUMA works - page granularity.

This however makes one particular thing harder - in a couple places we
accessed PGPROC entries through PROC_HDR->allProcs, which was pretty
much just one large array. And GetNumberFromPGProc() relied on array
arithmetics to determine procnumber. With the array partitioned, this
can't work the same way.

But there's a simple solution - if we turn allProcs into an array of
*pointers* to PGPROC arrays, there's no issue. All the places need a
pointer anyway. And then we need an explicit procnumber field in PGPROC,
instead of calculating it.

There's a chance this have negative impact on code that accessed PGPROC
very often, but so far I haven't seen such cases. But if you can come up
with such examples, I'd like to see those.

There's another detail - when obtaining a PGPROC entry in InitProcess(),
we try to get an entry from the same NUMA node. And only if that doesn't
work, we grab the first one from the list (there's still just one PGPROC
freelist, I haven't split that - maybe we should?).

This has a GUC "numa_procs_interleave", again "off" by default. It's not
quite correct, though, because the partitioning happens always. It only
affects the PGPROC lookup. (In a way, this may be a bit broken.)

6) v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch

This is an experimental patch, that simply pins the new process to the
NUMA node obtained from the freelist.

Driven by GUC "numa_procs_pin" (default: off).

Summary
-------

So this is what I have at the moment. I've tried to organize the patches
in the order of importance, but that's just my guess. It's entirely
possible there's something I missed, some other order might make more
sense, etc.

There's also the question how this is related to other patches affecting
shared memory - I think the most relevant one is the "shared buffers
online resize" by Ashutosh, simply because it touches the shared memory.

I don't think the splitting would actually make some things simpler, or
maybe more flexible - in particular, it'd allow us to enable huge pages
only for some regions (like shared buffers), and keep the small pages
e.g. for PGPROC. So that'd be good.

But there'd also need to be some logic to "rework" how shared buffers
get mapped to NUMA nodes after resizing. It'd be silly to start with
memory on 4 nodes (25% each), resize shared buffers to 50% and end up
with memory only on 2 of the nodes (because the other 2 nodes were
originally assigned the upper half of shared buffers).

I don't have a clear idea how this would be done, but I guess it'd
require a bit of code invoked sometime after the resize. It'd already
need to rebuild the freelists in some way, I guess.

The other thing I haven't thought about very much is determining on
which CPUs/nodes the instance is allowed to run. I assume we'd start by
simply inherit/determine that at the start through libnuma, not through
some custom PG configuration (which the patch [2] proposed to do).

regards

[1] https://www.youtube.com/watch?v=V75KpACdl6E

[2]
https://www.postgresql.org/message-id/CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw%40mail.gmail.com

--
Tomas Vondra

Attachment	Content-Type	Size
v1-0001-NUMA-interleaving-buffers.patch	text/x-patch	26.9 KB
v1-0002-NUMA-localalloc.patch	text/x-patch	3.7 KB
v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch	text/x-patch	1.6 KB
v1-0004-NUMA-partition-buffer-freelist.patch	text/x-patch	19.0 KB
v1-0005-NUMA-interleave-PGPROC-entries.patch	text/x-patch	34.9 KB
v1-0006-NUMA-pin-backends-to-NUMA-nodes.patch	text/x-patch	3.4 KB

Responses

Re: Adding basic NUMA awareness at 2025-07-02 11:37:28 from Ashutosh Bapat

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2025-07-01 19:09:16	Re: NUMA shared memory interleaving
Previous Message	Nathan Bossart	2025-07-01 19:00:54	Re: pg_get_multixact_members not documented