Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2025-08-04 14:24:40
Message-ID: 51e51832-7f47-412a-a1a6-b972101cc8cb@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's an updated version of the patch series. The main improvement is
the new 0006 patch, adding "adaptive balancing" of allocations. I'll
also share some results from a workload doing a lot of allocations.

adaptive balancing of allocations
---------------------------------

Imagine each backend only allocates buffers from the partition on the
same NUMA node. E.g. you have 4 NUMA nodes (i.e. 4 partitions), and a
backend only allocates buffers from "home" partition (on the same NUMA
node). This is what the earlier patch versions did, and with many
backends that's mostly fine (assuming the backends get spread over all
the NUMA nodes).

But if there's only few backends doing the allocations, this can result
in very inefficient use of shared buffers - a single backend would be
limited to 25% of buffers, even if the rest is unused.

There needs to be some say to "redirect" excess allocations to other
partitions, so that the partitions are utilized about the same. This is
what the 0006 patch aims to do (I kept is separate, but it should
probably get merged into the "clocksweep partitioning" in the end).

The balancing is fairly simple:

(1) It tracks the number of allocations "requested" from each partition.

(2) In regular intervals (by bgwriter) calculate the "fair share" per
partition, and determine what fraction of "requests" to handle from the
partition itself, and how many to redirect to other partitions.

(3) Calculate coefficients to drive this for each partition.

I emphasize (1) talks about "requests", not the actual allocations. Some
of the requests could have been redirected to different partitions, and
be counted as allocations there. We want to balance allocations, but it
relies on the requests.

To give you a simple example - imagine there are 2 partitions with this
number of allocation requests:

P1: 900,000 requests
P2: 100,000 requests

This means the "fair share" is 500,000 allocations, so P1 needs to
redirect some requests to P2. And we end with these weights:

P1: [ 55, 45]
P2: [ 0, 100]

Assuming the workload does not shift in some dramatic way, this should
result in both partitions handling ~500k allocations.

It's not hard to extend this algorithm to more partitions. For more
details see StrategySyncBalance(), which recalculates this.

There are a couple open questions, like:

* The algorithm combines the old/new weights by averaging, to add a bit
of hysteresis. Right now it's a simple average with 0.5 weight, to
dampen sudden changes. I think it works fine (in the long run), but I'm
open to suggestions how to do this better.

* There's probably additional things we should consider when deciding
where to redirect the allocations. For example, we may have multiple
partitions per NUMA node, in which case it's better to redirect to that
node as many allocations as possible. The current patch ignores this.

* The partitions may have slightly different sizes, but the balancing
ignores that for now. This is not very difficult to address.

clocksweep benchmark
--------------------

I ran a simple benchmark focused on allocation-heavy workloads, namely
large concurrent sequential scans. The attached scripts generate a
number of 1GB tables, and then run concurrent sequential scans with
shared buffers set to 60%, 75%, 90% and 110% of the total dataset size.

I did this for master, and with the NUMA patches applied (and the GUCs
set to 'on'). I also increased tried with the of partitions increased to
16 (so a NUMA node got multiple partitions).

There are results from three machines

1) ryzen - small non-NUMA system, mostly to see if there's regressions

2) xeon - older 2-node NUMA system

3) hb176 - big EPYC system with 176 cores / 4 NUMA nodes

The script records detailed TPS stats (e.g. percentiles), I'm attaching
CSV files with complete results, and some PDFs with charts summarizing
that (I'll get to that in a minute).

For the EPYC, the average tps for the three builds looks like this:

clients | master numa numa-16 | numa numa-16
----------|--------------------------------|---------------------
8 | 20 27 26 | 133% 129%
16 | 23 39 45 | 170% 193%
24 | 23 48 58 | 211% 252%
32 | 21 57 68 | 268% 321%
40 | 21 56 76 | 265% 363%
48 | 22 59 82 | 270% 375%
56 | 22 66 88 | 296% 397%
64 | 23 62 93 | 277% 411%
72 | 24 68 95 | 277% 389%
80 | 24 72 95 | 295% 391%
88 | 25 71 98 | 283% 392%
96 | 26 74 97 | 282% 369%
104 | 26 74 97 | 282% 367%
112 | 27 77 95 | 287% 355%
120 | 28 77 92 | 279% 335%
128 | 27 75 89 | 277% 328%

That's not bad - the clocksweep partitioning increases the throughput
2-3x. Having 16 partitions (instead of 4) helps yet a bit more, to 3-4x.

This is for shared buffers set to 60% of the dataset, which depends on
the number of clients / tables. With 64 clients/tables, there's 64GB of
data, and shared buffers are set to ~39GB.

The results for 75% and 90% follow the same pattern. For 110% there's
much less impact - there are no allocations, so this has to be thanks to
the other NUMA patches.

The charts in the attached PDFs add a bit more detail, with various
percentiles (of per-second throughput). The bands are roughly quartiles:
5-25%, 25-50%, 50-75%, 75-95%. The thick middle line is the median.

There's only charts for 60%, 90% and 110% shared buffers, for fit it on
a single page. There 75% is not very different.

For ryzen there's little difference. Not surprising, it's not a NUMA
system. So this is positive result, as there's no regression.

For xeon the patches help a little bit. Again, not surprising. It's a
fairly old system (~2016), and the differences between NUMA nodes are
not that significant.

For epyc (hb176), the differences are pretty massive.

regards

--
Tomas Vondra

Attachment Content-Type Size
numa-benchmark-ryzen.pdf application/pdf 162.5 KB
numa-benchmark-epyc.pdf application/pdf 167.9 KB
numa-benchmark-xeon.pdf application/pdf 154.6 KB
xeon.csv.gz application/gzip 12.3 KB
ryzen.csv.gz application/gzip 12.0 KB
hb176.csv.gz application/gzip 3.8 KB
run.sh application/x-shellscript 1.9 KB
generate.sh application/x-shellscript 1.1 KB
v20250804-0001-NUMA-interleaving-buffers.patch.gz application/gzip 12.0 KB
v20250804-0008-NUMA-pin-backends-to-NUMA-nodes.patch.gz application/gzip 1.6 KB
v20250804-0007-NUMA-interleave-PGPROC-entries.patch.gz application/gzip 13.0 KB
v20250804-0006-NUMA-clocksweep-allocation-balancing.patch.gz application/gzip 7.6 KB
v20250804-0005-NUMA-clockweep-partitioning.patch.gz application/gzip 10.4 KB
v20250804-0004-NUMA-partition-buffer-freelist.patch.gz application/gzip 6.8 KB
v20250804-0003-freelist-Don-t-track-tail-of-a-freelist.patch.gz application/gzip 896 bytes
v20250804-0002-NUMA-localalloc.patch.gz application/gzip 1.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-08-04 14:29:22 Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
Previous Message Bertrand Drouvot 2025-08-04 14:20:48 Adding per backend commit and rollback counters