Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2026-06-05 12:52:35
Message-ID: 334f7644-6716-4e63-a7c4-13c5cbcdd210@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's an updated version of the NUMA patch series, based on some recent
discussions about this (some at pgconf.dev, but not only that),

The main change is I significantly simplified some of the parts. Whe
patch from 20251126 was ~190K, the new version is maybe 100K, so about
half. Some of that is thanks to dropping the PGPROC partitioning
entirely, but the remaining parts are smaller too. I realize it's not a
great metric, of course.

In this message I'll explain the changes since 20251126. I'm yet to do a
thorough performance evaluation and see if it helps, I'll post that in
the next couple days.

The current patch series has these parts:

v20260605-0001-Add-shmem_populate-and-shmem_interleave-GU.patch
-------------

Somewhat unrelated, I find this useful for benchmarking and as a
baseline (what would happen if we just interleaved the shared segment).

v20260605-0002-Infrastructure-for-partitioning-of-shared-.patch
-------------

Just adds a small registry of partitions (ranges of shared buffers),
stored in shared memory, and pg_buffercache interface to inspect it.
Merely a foundation for the following patches.

v20260605-0003-NUMA-shared-buffers-partitioning.patch
-------------

The interesting part, that places some of the partitions to NUMA nodes.

v20260605-0004-clock-sweep-basic-partitioning.patch
v20260605-0005-clock-sweep-balancing-of-allocations.patch
v20260605-0006-clock-sweep-scan-all-partitions.patch
-------------

Patches that gradually partition clock-sweep. Ultimately, it should
probably be squashed into a single commit (each commit fixes some sort
of issue in the naive partitioning in 0004). But I kept them separate
because it's easier to review / understand what the issue is.

what changed?
-------------

First, I dropped the PGPROC partitioning. We may revisit that in the
future (not sure), but for now it was just a distraction and I see it as
less impactful than shared buffers / clock-sweep.

I also simplified the GUC to use a single on/off parameter (instead of
the debug_io_direct-like approach). We can revisit that, but for now
this seems more convenient.

The most significant change in the remaining parts is simplification of
the shared buffer partitioning. In particular, the partitioning is now
"best-effort" when it maps memory to shared buffers. Let me remind that
NUMA works at memory page granularity - we can't map arbitrary ranges of
memory to a node, it needs to be whole memory pages.

The 20251126 patch went into great lengths to (a) make sure BufferBlocks
and BufferDescriptors start at memory page boundary, are the partitions
are also properly aligned (both for blocks and descriptors). That was a
lot of code, it needs to happen even before we know if huge pages are
used, partitions might have been of (very) different sizes, and so on.

The new patch abandons this "perfect" partitioning, and instead does a
best-effort. It splits the buffers as evenly as possible, i.e. all
partitions have (NBuffers/npartitions) buffers, and then locates as much
memory as possible to a selected NUMA node.

With 4K pages, that's always the whole partition. With huge pages (which
is expected of relevant NUMA systems), there may be a couple buffers at
the beginning/end of a partition. But it's less than one memory page,
per partition, and we expect the systems to have 10s or 100s of GBs, so
in the bigger scheme of things it's negligible (fractions of a percent).

For buffer descriptors the math is a bit worse - descriptors need much
less memory, but even there it should not be more than ~1%.

Seems perfectly fine to me. Or rather, the extra complexity does not
seem worth the possible benefit.

This also allowed dropping a part of the "clock-sweep partitioning"
patches, dealing with cases when the partitions are of different sizes.
With this new best-effort scheme the difference is at most 1 buffer, and
we can just ignore that.

questions
---------

At this point, my main question is whether there's a better way to
partition clock-sweep and/or do the balancing of allocations between
partitions. I believe it does work, but I have a feeling there might be
a more elegant way to do this kind of stuff (like an established
balancing algorithm of some sort).

The other thing I need to verify is how this behaves with
kernel.nr_hugepages. With some earlier versions it was easy to end in a
situation where everything seemed to work, but then much later the
kernel realized it does not have enough huge pages on a particular NUMA
node and crashed with a segfault (or was it sigbus?).

Of course, the other question is performance validation - does it even
help? I plan to repeat the various experiments mentioned in this thread
(by Andres and others) on available NUMA machines. But if someone has an
idea for another benchmark (and/or what metric to measure, not just the
usual duration), let me know.

regards

--
Tomas Vondra

Attachment Content-Type Size
v20260605-0006-clock-sweep-scan-all-partitions.patch text/x-patch 6.2 KB
v20260605-0005-clock-sweep-balancing-of-allocations.patch text/x-patch 27.4 KB
v20260605-0004-clock-sweep-basic-partitioning.patch text/x-patch 34.0 KB
v20260605-0003-NUMA-shared-buffers-partitioning.patch text/x-patch 26.6 KB
v20260605-0002-Infrastructure-for-partitioning-of-shared-.patch text/x-patch 14.3 KB
v20260605-0001-Add-shmem_populate-and-shmem_interleave-GU.patch text/x-patch 4.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message wenhui qiu 2026-06-05 12:51:59 Re: [PATCH] vacuumdb: Add --exclude-database option