Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Adding basic NUMA awareness
Date: 2025-10-31 11:57:33
Message-ID: e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

here's a significantly reworked version of this patch series.

I had a couple discussions about these patches at pgconf.eu last week,
and one interesting suggestion was that maybe it'd be easier to the
clock-sweep partitioning first, in a NUMA-oblivious way. And then add
the NUMA stuff later.

The logic is that this way we could ignore some of the hard stuff (e.g.
handling huge page reservation), while still reducing clocksweep
contention. Which we speculated might be the main benefit anyway.

The attached patches do this.

0001 - Introduces a simplified version of the "buffer partition
registry" (think array in shmem, storing info about ranges of shared
buffer). The partitions are calculated as simple fraction of shared
buffers. There's no need to align the partitions to memory pages etc.

0002-0005 - Does the clock-sweep partitioning. I chose to keep this
split into smaller increments, to keep the patches easier to review.

0006 - Make the partitioning NUMA-aware. This used to be part of 0001,
but now it's moved on top of the clock-sweep stuff. It ensures the
partitions are properly aligned to memory pages, and all that.

0007 - PGPROC partitioning.

This made the 0001 patch much simpler/smaller - it used to be ~50kB, now
it's 15kB (and most of the complexity is in 0006).

The question however is how this performs, or how much of the benefit
was due to NUMA-awareness and how much was due to just partitioning
clock-sweep. I repeated the benchmark from [1], doing concurrent
sequential scans to put significant pressure on buffer replacements, and
I got this:

hp clients | master | sweep sweep-16 | numa numa-16
=============|==========|===================|===============
off 16 | 24 | 46 46 | 33 40
32 | 33 | 53 51 | 45 51
48 | 38 | 51 61 | 46 56
64 | 41 | 56 75 | 47 65
80 | 47 | 53 77 | 48 71
96 | 45 | 54 80 | 47 66
112 | 45 | 52 83 | 44 65
128 | 43 | 55 81 | 39 48
-------------|----------|-------------------|---------------
on 16 | 26 | 47 47 | 35 42
32 | 33 | 49 52 | 40 49
48 | 39 | 52 63 | 43 57
64 | 42 | 53 72 | 43 66
80 | 43 | 54 81 | 46 71
96 | 48 | 58 80 | 49 73
112 | 51 | 58 78 | 51 76
128 | 55 | 60 83 | 52 76

"hp" means huge pages, the compared branches are:

- master - current master
- sweep - patches up to 0005, default number of partitions (4)
- sweep-16 - patches up to 0005, 16 partitions
- numa - patches up to 0006, default number of partitions (4)
- numa-16 - patches up to 0006, 16 partitions

Compared to master, the results look like this:

hp clients | sweep sweep-16 | numa numa-16
==============|====================|================
off 16 | 192% 192% | 138% 167%
32 | 161% 155% | 136% 155%
48 | 132% 160% | 121% 145%
64 | 137% 183% | 115% 159%
80 | 113% 164% | 102% 151%
96 | 120% 177% | 104% 146%
112 | 116% 184% | 98% 144%
128 | 128% 186% | 90% 110%
--------------|--------------------|----------------
on 16 | 181% 181% | 135% 162%
32 | 148% 158% | 121% 148%
48 | 133% 161% | 110% 144%
64 | 126% 171% | 102% 157%
80 | 126% 188% | 107% 165%
96 | 121% 167% | 102% 152%
112 | 114% 153% | 100% 149%
128 | 109% 151% | 95% 138%

The attached PDF has more results for runs with somewhat modified
parameters, but the overall it's very similar to these numbers.

I think this confirms most of the benefit really comes from just
partitioning clock-sweep, and it's mostly independent of the NUMA stuff.
In fact, the NUMA partitioning is often slower. Some of this may be due
to inefficiencies in the patch (e.g. division in formula calculating the
partition index, etc.).

So I think this looks quite promising ...

There are a couple unsolved issues, though. While running the tests, I
ran into a bunch of weird issues. I saw two types of failures:

1) Bad address
-----------------------------------------------------------------------
2025-10-30 15:24:21.195 UTC [2038558] LOG: could not read blocks
114543..114558 in file "base/16384/16588": Bad address
2025-10-30 15:24:21.195 UTC [2038558] STATEMENT: SELECT * FROM t_41
OFFSET 1000000000

2025-10-30 15:24:21.195 UTC [2038523] LOG: could not read blocks
119981..119996 in file "base/16384/16869": Bad address
2025-10-30 15:24:21.195 UTC [2038523] CONTEXT: completing I/O on behalf
of process 2038464
2025-10-30 15:24:21.195 UTC [2038523] STATEMENT: SELECT * FROM t_96
OFFSET 1000000000

2025-10-30 15:24:21.195 UTC [2038492] LOG: could not read blocks
118226..118232 in file "base/16384/16478": Bad address
2025-10-30 15:24:21.195 UTC [2038492] STATEMENT: SELECT * FROM t_19
OFFSET 1000000000

2025-10-30 15:24:21.196 UTC [2038477] LOG: could not read blocks
120515..120517 in file "base/16384/16945": Bad address
2025-10-30 15:24:21.196 UTC [2038477] CONTEXT: completing I/O on behalf
of process 2038545
2025-10-30 15:24:21.196 UTC [2038477] STATEMENT: SELECT * FROM t_111
OFFSET 1000000000
-----------------------------------------------------------------------

2) Operation canceled
-----------------------------------------------------------------------
2025-10-31 10:57:21.742 UTC [2685933] LOG: could not read blocks
159..174 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.742 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000

2025-10-31 10:57:21.742 UTC [2685933] LOG: could not read blocks
143..158 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.742 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000

2025-10-31 10:57:21.781 UTC [2685933] ERROR: could not read blocks
143..158 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.781 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000
-----------------------------------------------------------------------

I'm still not sure what's causing these, and it's happening rarely and
randomly, so it's hard to catch and reproduce. I'd welcome suggestions
what to look for / what might be the issue.

I did run the whole test under valgrind to make sure there's nothing
obviously broken, but that found no issues. Of course, it's much slower
under valgrind, so maybe it just didn't hit the issue.

I suspect the "bad address" might be just a different symptom of the
issues with reserving huge pages I already mentioned [2]. I assume
io_uring might try using huge pages internally, and then it fails
because postgres also reserves huge pages.

I have no idea what "operation canceled" might be about.

I'm not entirely sure if this affect all patches, or just the patches
with NUMA partitioning. Or if this happens with huge pages. I'll do more
runs to test specifically this.

But it does seem to be specific to io_uring - or at least the canceled
issue. I haven't seen it after switching to "worker".

[1]
https://www.postgresql.org/message-id/51e51832-7f47-412a-a1a6-b972101cc8cb%40vondra.me

[2]
https://www.postgresql.org/message-id/1d57d68d-b178-415a-ba11-be0c3714638e%40vondra.me

regards

--
Tomas Vondra

Attachment Content-Type Size
clocksweep-results.pdf application/pdf 60.4 KB
v20251101-0007-NUMA-partition-PGPROC.patch text/x-patch 49.2 KB
v20251101-0006-NUMA-shared-buffers-partitioning.patch text/x-patch 43.6 KB
v20251101-0005-clock-sweep-weighted-balancing.patch text/x-patch 5.2 KB
v20251101-0004-clock-sweep-scan-all-partitions.patch text/x-patch 6.7 KB
v20251101-0003-clock-sweep-balancing-of-allocations.patch text/x-patch 25.3 KB
v20251101-0002-clock-sweep-basic-partitioning.patch text/x-patch 33.9 KB
v20251101-0001-Infrastructure-for-partitioning-shared-buf.patch text/x-patch 15.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2025-10-31 12:13:13 Re: meson vs. llvm bitcode files
Previous Message jian he 2025-10-31 11:54:38 Re: minor error message enhance: print RLS policy name when only one permissive policy exists