| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-01-12 23:58:49 |
| Message-ID: | 0e1b997d-99c8-40f4-bc32-6c044bc7ed9a@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 1/10/26 02:42, Andres Freund wrote:
> Hi,
>
> On 2025-12-08 21:02:27 +0100, Tomas Vondra wrote:
>> * Most of the benefit comes from patches unrelated to NUMA. The initial
>> patches partition clockweep, in a NUMA oblivious way. In fact, applying
>> the NUMA patches often *reduces* the throughput. So if we're concerned
>> about contention on the clocksweep hand, we could apply just these first
>> patches. That way we wouldn't have to deal with huge pages.
>
>> * Furthermore, I'm not quite sure clocksweep really is a bottleneck in
>> realistic cases. The benchmark used in this thread does many concurrent
>> sequential scans, on data that exceeds shared buffers / fits into RAM.
>> Perhaps that happens, but I doubt it's all that common.
>
> I think this misses that this isn't necessarily about peak throughput under
> concurrent contention. Consider this scenario:
>
> 1) shared buffers is already allocated from a kernel POV, i.e. pages reside on
> some numa node instead of having to be allocated on the first access
>
> 2) one backend does a scan of scan of a relation [largely] not in shared
> buffers
>
> Whether the buffers for the ringbuffer (if the relation is > NBuffers/4) or
> for the entire relation (if smaller) is allocated on the same node as the
> backend makes a quite substantial difference. I see about a 25% difference
> even on a small-ish numa system.
>
> Partitioned clocksweep makes it vastly more likely that data is on the local
> numa node.
>
> If you simulate different locality modes with numactl, I can see pretty
> drastic differences for the processing of individual queries, both with
> parallel and non-parallel processing.
>
>
> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUP BY 1;' && perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROM pgbench_accounts;'
>
> membind 0, cpunodebind 1, max_parallel_workers_per_gather=0:
> S0 6 341,635,792 UNC_M_CAS_COUNT.WR # 4276.9 MB/s memory_bandwidth_write
> S0 20 5,116,381,542 duration_time
> S0 6 255,977,795 UNC_M_CAS_COUNT.RD # 3204.6 MB/s memory_bandwidth_read
> S0 20 5,116,391,355 duration_time
> S1 6 2,418,579 UNC_M_CAS_COUNT.WR # 30.3 MB/s memory_bandwidth_write
> S1 6 115,511,123 UNC_M_CAS_COUNT.RD # 1446.1 MB/s memory_bandwidth_read
>
> 5.112286670 seconds time elapsed
>
>
> membind 1, cpunodebind 1, max_parallel_workers_per_gather=0:
> S0 6 16,528,154 UNC_M_CAS_COUNT.WR # 248.1 MB/s memory_bandwidth_write
> S0 20 4,267,078,201 duration_time
> S0 6 40,327,670 UNC_M_CAS_COUNT.RD # 605.4 MB/s memory_bandwidth_read
> S0 20 4,267,088,762 duration_time
> S1 6 116,925,559 UNC_M_CAS_COUNT.WR # 1755.2 MB/s memory_bandwidth_write
> S1 6 244,251,242 UNC_M_CAS_COUNT.RD # 3666.5 MB/s memory_bandwidth_read
>
> 4.263442844 seconds time elapsed
>
>
> interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=0:
>
> S0 6 196,713,044 UNC_M_CAS_COUNT.WR # 2757.4 MB/s memory_bandwidth_write
> S0 20 4,569,805,767 duration_time
> S0 6 167,497,804 UNC_M_CAS_COUNT.RD # 2347.9 MB/s memory_bandwidth_read
> S0 20 4,569,816,439 duration_time
> S1 6 81,992,696 UNC_M_CAS_COUNT.WR # 1149.3 MB/s memory_bandwidth_write
> S1 6 192,265,269 UNC_M_CAS_COUNT.RD # 2695.1 MB/s memory_bandwidth_read
>
> 4.565722468 seconds time elapsed
>
>
> membind 0, cpunodebind 1, max_parallel_workers_per_gather=8:
> S0 6 336,538,518 UNC_M_CAS_COUNT.WR # 24130.2 MB/s memory_bandwidth_write
> S0 20 895,976,459 duration_time
> S0 6 238,663,716 UNC_M_CAS_COUNT.RD # 17112.4 MB/s memory_bandwidth_read
> S0 20 895,986,193 duration_time
> S1 6 2,594,371 UNC_M_CAS_COUNT.WR # 186.0 MB/s memory_bandwidth_write
> S1 6 113,981,673 UNC_M_CAS_COUNT.RD # 8172.6 MB/s memory_bandwidth_read
>
> 0.892594989 seconds time elapsed
>
>
> membind 1, cpunodebind 1, max_parallel_workers_per_gather=8:
> S0 6 3,492,673 UNC_M_CAS_COUNT.WR # 322.0 MB/s memory_bandwidth_write
> S0 20 698,175,650 duration_time
> S0 6 5,363,152 UNC_M_CAS_COUNT.RD # 494.4 MB/s memory_bandwidth_read
> S0 20 698,187,522 duration_time
> S1 6 117,181,190 UNC_M_CAS_COUNT.WR # 10802.4 MB/s memory_bandwidth_write
> S1 6 251,059,297 UNC_M_CAS_COUNT.RD # 23144.0 MB/s memory_bandwidth_read
>
> 0.694253637 seconds time elapsed
>
>
> interleave 0,1, cpunodebind 1, max_parallel_workers_per_gather=8:
>
> S0 6 170,352,086 UNC_M_CAS_COUNT.WR # 13767.3 MB/s memory_bandwidth_write
> S0 20 797,166,139 duration_time
> S0 6 121,646,666 UNC_M_CAS_COUNT.RD # 9831.1 MB/s memory_bandwidth_read
> S0 20 797,175,899 duration_time
> S1 6 60,099,863 UNC_M_CAS_COUNT.WR # 4857.1 MB/s memory_bandwidth_write
> S1 6 182,035,468 UNC_M_CAS_COUNT.RD # 14711.5 MB/s memory_bandwidth_read
>
> 0.791915733 seconds time elapsed
>
>
>
> You're never going to be quite as good when actually using both NUMA nodes,
> but at least simple workloads like the above should be able to get a lot
> closer to the good number from above than we currently are.
>
I see no such improvements, unfortunately. Even when I explicitly pin
memory and cpus to different nodes using numactl. Consider a simple
experiment, starting an instance either like this:
numactl --membind=0 --cpunodebind=0 pg_ctl -D /mnt/data/data-numa start
or like this
numactl --membind=0 --cpunodebind=1 pg_ctl -D /mnt/data/data-numa start
on a 2-node NUMA cluster. To the best of my knowledge this means that
either both the memory and all pg processes (including the backend) are
on node 0, of memory is on node 0 and backend is on node 1.
And then I initialized pgbench with scale that is much larger than
shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
then I ran
select * from pgbench_accounts offset 1000000000;
which does a sequential scan with the circular buffer you mention abobe
I've made all reasonable precautions to stabilize the results, like
enabling huge pages (both for shared memory and binaries), disabling
checksums, ... And I ran that on an Azure instance D96v6 with EPYC 9V74.
This was with scale 10000 (~150GB), shared_buffers=8GB.
And I get this:
worker / 32
-----------
numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 26280.437 ms (00:26.280)
Time: 26177.165 ms (00:26.177)
Time: 26182.222 ms (00:26.182)
Time: 26174.421 ms (00:26.174)
Time: 26216.989 ms (00:26.217)
numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 26412.878 ms (00:26.413)
Time: 26413.332 ms (00:26.413)
Time: 26202.899 ms (00:26.203)
Time: 26412.627 ms (00:26.413)
Time: 26484.962 ms (00:26.485)
io_uring
--------
numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 26286.977 ms (00:26.287)
Time: 26499.830 ms (00:26.500)
Time: 26629.990 ms (00:26.630)
Time: 26443.147 ms (00:26.443)
numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 26727.655 ms (00:26.728)
Time: 26787.456 ms (00:26.787)
Time: 26484.260 ms (00:26.484)
Time: 26250.737 ms (00:26.251)
Time: 26208.913 ms (00:26.209)
I don't see any difference. To rule out any virtualization weirdness, I
did the same experiment on my old Xeon machine (also 2-node NUMA), just
with a smaller scale (2000) and shared_buffers=4GB. And that gave me:
xeon scale=2000 nochecksums
worker / 32
-----------
numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 5519.728 ms (00:05.520)
Time: 5570.215 ms (00:05.570)
Time: 5568.233 ms (00:05.568)
Time: 5556.465 ms (00:05.556)
Time: 5517.420 ms (00:05.517)
numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 5639.281 ms (00:05.639)
Time: 5657.822 ms (00:05.658)
Time: 5653.077 ms (00:05.653)
Time: 5647.780 ms (00:05.648)
Time: 5647.288 ms (00:05.647)
io_uring
--------
numactl --membind=0 --cpunodebind=0 pg_ctl ...
Time: 7517.920 ms (00:07.518)
Time: 7180.628 ms (00:07.181)
Time: 7162.801 ms (00:07.163)
Time: 7164.827 ms (00:07.165)
Time: 7177.757 ms (00:07.178)
numactl --membind=0 --cpunodebind=1 pg_ctl ...
Time: 7622.372 ms (00:07.622)
Time: 7571.923 ms (00:07.572)
Time: 7571.966 ms (00:07.572)
Time: 7568.269 ms (00:07.568)
Time: 7558.195 ms (00:07.558)
If I squint a little bit, there's difference for io_uring. But it's not
even 5%, definitely not 25%.
>
>
> Maybe the problem is that the patchset doesn't actually quite work right now?
> I checked out numa-20251111 and ran a query for a 1GB table in a 40GB s_b
> system: there's not much more locality with debug_numa=buffers, than without
> (roughly 55% on one node, 45% on the other). Making it not surprising that the
> results aren't great.
>
Hard to say, but I'd guess that's because of the clocksweep balancing.
Which ensures that we don't overload a single NUMA node. Imagine an
instance with a single connection - it can't allocate from a single NUMA
node, because that'd mean it'll only ever use 50% of available cache.
Which does not seem great. Maybe there's a better way to address this.
>
>
>> I've been unable to demonstrate any benefits on other workloads, even if
>> there's a lot of buffer misses / reads into shared buffers. As soon as
>> the query starts doing something else, the clocksweep contention becomes
>> a non-issue. Consider for example read-only pgbench with database much
>> larger than shared buffers (but still within RAM). The cost of the index
>> scans (and other nodes) seems to reduce the pressure on clocksweep.
>>
>> So I'm skeptical about clocksweep pressure being a serious issue, except
>> for some very narrow benchmarks (like the concurrent seqscan test). And
>> even if this happened for some realistic cases, partitioning the buffers
>> in a NUMA-oblivious way seems to do the trick.
>
> I think you're over-indexing on the contention aspect and under-indexing on
> the locality benefits.
>
I've been unable to demonstrate meaningful benefits of locality (like in
the example above), while I've been able to show benefits of reducing
the clocksweep contention. It's entirely possible I'm doing it wrong or
missing something, of course.
>
>> When discussing this stuff off list, it was suggested this might help
>> with the scenario Andres presented in [3], where the throughput improves
>> a lot with multiple databases. I've not observed that in practice, and I
>> don't think these patches really can help with that. That scenario is
>> about buffer lock contention, not clocksweep contention.
>
> Buffer content and buffer headers being on your local node makes access
> faster...
>
That was my expectation too, but I haven't seen meaningful improvements
in any benchmark.
For example in the benchmark I presented earlier, all the memory is on
node 0 (so both headers and buffers). And there does not seem to be any
measurable difference when accessing it from node 0 vs. node 1. So why
would it matter than header may be on node 0 and buffer on node 1?
>
>> Attached is a tiny patch doing mostly what Jakub did, except that it
>> does two things. First, it allows interleaving the shared memory on all
>> relevant NUMA nodes (per numa_get_mems_allowed). Second, it allows
>> populating all memory by setting MAP_POPULATE in mmap(). There's a new
>> GUC to enable each of these.
>
>> I think we should try this (much simpler) approach first, or something
>> close to it. Sorry for dragging everyone into a much more complex
>> approach, which now seems to be a dead end.
>
> I'm somewhat doubtful that interleaving is going to be good enough without
> some awareness of which buffers to preferrably use. Additionally, without huge
> pages, there are significant negative performance effects due to each buffer
> being split across two numa nodes.
>
I'm rather skeptical this being worth it without huge pages. If you're
trying to get the best performance on a NUMA machine (with is likely big
with a lot of RAM), then huge pages are a huge improvement on their own.
I'd even say this NUMA stuff might/should require huge_pages=on.
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2026-01-13 00:10:00 | Re: Adding basic NUMA awareness |
| Previous Message | Andrew Jackson | 2026-01-12 23:53:14 | Re: Add ldapservice connection parameter |