| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2025-10-31 11:57:33 |
| Message-ID: | e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
here's a significantly reworked version of this patch series.
I had a couple discussions about these patches at pgconf.eu last week,
and one interesting suggestion was that maybe it'd be easier to the
clock-sweep partitioning first, in a NUMA-oblivious way. And then add
the NUMA stuff later.
The logic is that this way we could ignore some of the hard stuff (e.g.
handling huge page reservation), while still reducing clocksweep
contention. Which we speculated might be the main benefit anyway.
The attached patches do this.
0001 - Introduces a simplified version of the "buffer partition
registry" (think array in shmem, storing info about ranges of shared
buffer). The partitions are calculated as simple fraction of shared
buffers. There's no need to align the partitions to memory pages etc.
0002-0005 - Does the clock-sweep partitioning. I chose to keep this
split into smaller increments, to keep the patches easier to review.
0006 - Make the partitioning NUMA-aware. This used to be part of 0001,
but now it's moved on top of the clock-sweep stuff. It ensures the
partitions are properly aligned to memory pages, and all that.
0007 - PGPROC partitioning.
This made the 0001 patch much simpler/smaller - it used to be ~50kB, now
it's 15kB (and most of the complexity is in 0006).
The question however is how this performs, or how much of the benefit
was due to NUMA-awareness and how much was due to just partitioning
clock-sweep. I repeated the benchmark from [1], doing concurrent
sequential scans to put significant pressure on buffer replacements, and
I got this:
hp clients | master | sweep sweep-16 | numa numa-16
=============|==========|===================|===============
off 16 | 24 | 46 46 | 33 40
32 | 33 | 53 51 | 45 51
48 | 38 | 51 61 | 46 56
64 | 41 | 56 75 | 47 65
80 | 47 | 53 77 | 48 71
96 | 45 | 54 80 | 47 66
112 | 45 | 52 83 | 44 65
128 | 43 | 55 81 | 39 48
-------------|----------|-------------------|---------------
on 16 | 26 | 47 47 | 35 42
32 | 33 | 49 52 | 40 49
48 | 39 | 52 63 | 43 57
64 | 42 | 53 72 | 43 66
80 | 43 | 54 81 | 46 71
96 | 48 | 58 80 | 49 73
112 | 51 | 58 78 | 51 76
128 | 55 | 60 83 | 52 76
"hp" means huge pages, the compared branches are:
- master - current master
- sweep - patches up to 0005, default number of partitions (4)
- sweep-16 - patches up to 0005, 16 partitions
- numa - patches up to 0006, default number of partitions (4)
- numa-16 - patches up to 0006, 16 partitions
Compared to master, the results look like this:
hp clients | sweep sweep-16 | numa numa-16
==============|====================|================
off 16 | 192% 192% | 138% 167%
32 | 161% 155% | 136% 155%
48 | 132% 160% | 121% 145%
64 | 137% 183% | 115% 159%
80 | 113% 164% | 102% 151%
96 | 120% 177% | 104% 146%
112 | 116% 184% | 98% 144%
128 | 128% 186% | 90% 110%
--------------|--------------------|----------------
on 16 | 181% 181% | 135% 162%
32 | 148% 158% | 121% 148%
48 | 133% 161% | 110% 144%
64 | 126% 171% | 102% 157%
80 | 126% 188% | 107% 165%
96 | 121% 167% | 102% 152%
112 | 114% 153% | 100% 149%
128 | 109% 151% | 95% 138%
The attached PDF has more results for runs with somewhat modified
parameters, but the overall it's very similar to these numbers.
I think this confirms most of the benefit really comes from just
partitioning clock-sweep, and it's mostly independent of the NUMA stuff.
In fact, the NUMA partitioning is often slower. Some of this may be due
to inefficiencies in the patch (e.g. division in formula calculating the
partition index, etc.).
So I think this looks quite promising ...
There are a couple unsolved issues, though. While running the tests, I
ran into a bunch of weird issues. I saw two types of failures:
1) Bad address
-----------------------------------------------------------------------
2025-10-30 15:24:21.195 UTC [2038558] LOG: could not read blocks
114543..114558 in file "base/16384/16588": Bad address
2025-10-30 15:24:21.195 UTC [2038558] STATEMENT: SELECT * FROM t_41
OFFSET 1000000000
2025-10-30 15:24:21.195 UTC [2038523] LOG: could not read blocks
119981..119996 in file "base/16384/16869": Bad address
2025-10-30 15:24:21.195 UTC [2038523] CONTEXT: completing I/O on behalf
of process 2038464
2025-10-30 15:24:21.195 UTC [2038523] STATEMENT: SELECT * FROM t_96
OFFSET 1000000000
2025-10-30 15:24:21.195 UTC [2038492] LOG: could not read blocks
118226..118232 in file "base/16384/16478": Bad address
2025-10-30 15:24:21.195 UTC [2038492] STATEMENT: SELECT * FROM t_19
OFFSET 1000000000
2025-10-30 15:24:21.196 UTC [2038477] LOG: could not read blocks
120515..120517 in file "base/16384/16945": Bad address
2025-10-30 15:24:21.196 UTC [2038477] CONTEXT: completing I/O on behalf
of process 2038545
2025-10-30 15:24:21.196 UTC [2038477] STATEMENT: SELECT * FROM t_111
OFFSET 1000000000
-----------------------------------------------------------------------
2) Operation canceled
-----------------------------------------------------------------------
2025-10-31 10:57:21.742 UTC [2685933] LOG: could not read blocks
159..174 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.742 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000
2025-10-31 10:57:21.742 UTC [2685933] LOG: could not read blocks
143..158 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.742 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000
2025-10-31 10:57:21.781 UTC [2685933] ERROR: could not read blocks
143..158 in file "base/16384/16398": Operation canceled
2025-10-31 10:57:21.781 UTC [2685933] STATEMENT: SELECT * FROM t_3
OFFSET 1000000000
-----------------------------------------------------------------------
I'm still not sure what's causing these, and it's happening rarely and
randomly, so it's hard to catch and reproduce. I'd welcome suggestions
what to look for / what might be the issue.
I did run the whole test under valgrind to make sure there's nothing
obviously broken, but that found no issues. Of course, it's much slower
under valgrind, so maybe it just didn't hit the issue.
I suspect the "bad address" might be just a different symptom of the
issues with reserving huge pages I already mentioned [2]. I assume
io_uring might try using huge pages internally, and then it fails
because postgres also reserves huge pages.
I have no idea what "operation canceled" might be about.
I'm not entirely sure if this affect all patches, or just the patches
with NUMA partitioning. Or if this happens with huge pages. I'll do more
runs to test specifically this.
But it does seem to be specific to io_uring - or at least the canceled
issue. I haven't seen it after switching to "worker".
[1]
https://www.postgresql.org/message-id/51e51832-7f47-412a-a1a6-b972101cc8cb%40vondra.me
[2]
https://www.postgresql.org/message-id/1d57d68d-b178-415a-ba11-be0c3714638e%40vondra.me
regards
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| clocksweep-results.pdf | application/pdf | 60.4 KB |
| v20251101-0007-NUMA-partition-PGPROC.patch | text/x-patch | 49.2 KB |
| v20251101-0006-NUMA-shared-buffers-partitioning.patch | text/x-patch | 43.6 KB |
| v20251101-0005-clock-sweep-weighted-balancing.patch | text/x-patch | 5.2 KB |
| v20251101-0004-clock-sweep-scan-all-partitions.patch | text/x-patch | 6.7 KB |
| v20251101-0003-clock-sweep-balancing-of-allocations.patch | text/x-patch | 25.3 KB |
| v20251101-0002-clock-sweep-basic-partitioning.patch | text/x-patch | 33.9 KB |
| v20251101-0001-Infrastructure-for-partitioning-shared-buf.patch | text/x-patch | 15.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nazir Bilal Yavuz | 2025-10-31 12:13:13 | Re: meson vs. llvm bitcode files |
| Previous Message | jian he | 2025-10-31 11:54:38 | Re: minor error message enhance: print RLS policy name when only one permissive policy exists |