Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2025-07-28 14:19:07
Message-ID: 71a46484-053c-4b81-ba32-ddac050a8b5d@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's a somewhat cleaned up v3 of this patch series, with various
improvements and a lot of cleanup. Still WIP, but I hope it resolves the
various crashes reported for v2, but it still requires --with-libnuma
(it won't build without it).

I'm aware there's an ongoing discussion about removing the freelists,
and changing the clocksweep in some way. If that happens, the relevant
parts of this series will need some adjustment, of course. I haven't
looked into that yet, I plan to review those patches soon.

main changes in v3
------------------

1) I've introduced "registry" of the buffer partitions (imagine a small
array of structs), serving as a source of truth for places that need
info about the partitions (range of buffers, ...).

With v2 there was no "shared definition" - the shared buffers, freelist
and clocksweep did their own thing. But per the discussion it doesn't
really make much sense for to partition buffers in different ways.

So in v3 the 0001 patch defines the partitions, records them in shared
memory (in a small array), and the later parts just reuse this.

I also added a pg_buffercache_partitions() listing the partitions, with
first/last buffer, etc. The freelist/clocksweep patches add additional
information.

2) The PGPROC part introduces a similar registry, even though there are
no other patches building on this. But it seemed useful to have a clear
place recording this info.

There's also a view pg_buffercache_pgproc. The pg_buffercache location
is a bit bogus - it has nothing to do with buffers, but it was good
enough for now.

3) The PGPROC partitioning is reworked and should fix the crash with the
GUC set to "off".

4) This still doesn't do anything about "balancing" the clocksweep. I
have some ideas how to do that, I'll work on that next.

simple benchmark
----------------

I did a simple benchmark, measuring pgbench throughput with scale still
fitting into RAM, but much larger (~2x) than shared buffers. See the
attached test script, testing builds with more and more of the patches.

I'm attaching results from two different machines (the "usual" 2P xeon
and also a much larger cloud instance with EPYC/Genoa) - both the raw
CSV files, with average tps and percentiles, and PDFs. The PDFs also
have a comparison either to the "preceding" build (right side), or to
master (below the table).

There's results for the three "pgbench pinning" strategies, and that can
have pretty significant impact (colocated generally performs much better
than either "none" or "random").

For the "bigger" machine (wiuth 176 cores) the incremental results look
like this (for pinning=none, i.e. regular pgbench):

mode s_b buffers localal no-tail freelist sweep pgproc pinning
====================================================================
prepared 16GB 99% 101% 100% 103% 111% 99% 102%
32GB 98% 102% 99% 103% 107% 101% 112%
8GB 97% 102% 100% 102% 101% 101% 106%
--------------------------------------------------------------------
simple 16GB 100% 100% 99% 105% 108% 99% 108%
32GB 98% 101% 100% 103% 100% 101% 97%
8GB 100% 100% 101% 99% 100% 104% 104%

The way I read this is that the first three patches have about no impact
on throughput. Then freelist partitioning and (especially) clocksweep
partitioning can help quite a bit. pgproc is again close to ~0%, and
PGPROC pinning can help again (but this part is merely experimental).

For the xeon the differences (in either direction) are much smaller, so
I'm not going to post it here. It's in the PDF, though.

I think this looks reasonable. The way I see this patch series is not
about improving peak throughput, but more about reducing imbalance and
making the behavior more consistent.

The results are more a confirmation there's not some sort of massive
overhead, somewhere. But I'll get to this in a minute.

To quantify this kind of improvement, I think we'll need tests that
intentionally cause (or try to) imbalance. If you have ideas for such
tests, let me know.

overhead of partitioning calculation
------------------------------------

Regarding the "overhead", while the results look mostly OK, I think
we'll need to rethink the partitioning scheme - particularly how the
partition size is calculated. The current scheme has to use %, which can
be somewhat expensive.

The 0001 patch calculates a "chunk size", which is the smallest number
of buffers it can "assign" to a NUMA node. This depends on how many
buffer descriptors fit onto a single memory page, and it's either 512KB
(with 4KB pages), or 256MB (with 2MB huge pages). And then each NUMA
node gets multiple chunks, to cover shared_buffers/num_nodes. But this
can be an arbitrary number - it minimizes the imbalance, but it also
forces the use of % and / in the formulas.

AFAIK if we required the partitions to be 2^k multiples of the chunk
size, we could switch to using shifts and masking. Which is supposed to
be much faster. But I haven't measured this, and the cost is that some
of the nodes could get much less memory. Maybe that's fine.

reserving number of huge pages
------------------------------

The other thing I realized is that partitioning buffers with huge pages
is quite tricky, and can easily lead to SIGBUS when accessing the memory
later. The crashes I saw happen like this:

1) figure # of pages needed (using shared_memory_size_in_huge_pages)

This can be 16828 for shared_buffers=32GB.

2) make sure there's enough huge pages

echo 16828 > /proc/sys/vm/nr_hugepages

3) start postgres - everything seems to works just fine

4) query pg_buffercache_numa - triggers SIGBUS accessing memory for a
valid buffer (usually ~2GB from the end)

It took me ages to realize what's happening, but it's very simple. The
nr_hugepages is a global limit, but it's also translated into limits for
each NUMA node. So when you write 16828 to it, in a 4-node system each
node gets 1/4 of that. See

$ numastat -cm

Then we do the mmap(), and everything looks great, because there really
is enough huge pages and the system can allocate memory from any NUMA
node it needs.

And then we come around, and do the numa_tonode_memory(). And that's
where the issues start, because AFAIK this does not check the per-node
limit of huge pages in any way. It just appears to work. And then later,
when we finally touch the buffer, it tries to actually allocate the
memory on the node, and realizes there's not enough huge pages. And
triggers the SIGBUS.

You may ask why the per-node limit is too low. We still need just
shared_memory_size_in_huge_pages, right? And if we were partitioning the
whole memory segment, that'd be true. But we only to that for shared
buffers, and there's a lot of other shared memory - could be 1-2GB or
so, depending on the configuration.

And this gets placed on one of the nodes, and it counts against the
limit on that particular node. And so it doesn't have enough huge pages
to back the partition of shared buffers.

The only way around this I found is by inflating the number of huge
pages, significantly above the shared_memory_size_in_huge_pages value.
Just to make sure the nodes get enough huge pages.

I don't know what to do about this. It's quite annoying. If we only used
huge pages for the partitioned parts, this wouldn't be a problem.

I also realize this can be used to make sure the memory is balanced on
NUMA systems. Because if you set nr_hugepages, the kernel will ensure
the shared memory is distributed on all the nodes.

It won't have the benefits of "coordinating" the buffers and buffer
descriptors, and so on. But it will be balanced.

regards

--
Tomas Vondra

Attachment Content-Type Size
v3-0007-NUMA-pin-backends-to-NUMA-nodes.patch text/x-patch 3.5 KB
v3-0006-NUMA-interleave-PGPROC-entries.patch text/x-patch 46.8 KB
v3-0005-NUMA-clockweep-partitioning.patch text/x-patch 39.4 KB
v3-0004-NUMA-partition-buffer-freelist.patch text/x-patch 20.9 KB
v3-0003-freelist-Don-t-track-tail-of-a-freelist.patch text/x-patch 1.6 KB
v3-0002-NUMA-localalloc.patch text/x-patch 3.7 KB
v3-0001-NUMA-interleaving-buffers.patch text/x-patch 38.2 KB
numa-hb176.csv text/csv 17.8 KB
run-huge-pages.sh application/x-shellscript 4.6 KB
numa-xeon.csv text/csv 17.0 KB
numa-xeon-e5-2699.pdf application/pdf 53.6 KB
numa-hb176.pdf application/pdf 54.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-07-28 14:47:22 Re: vacuumdb changes for stats import/export
Previous Message Tom Lane 2025-07-28 14:15:17 Re: libxml2 author overwhelmed with security requests