Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2026-06-24 20:26:29
Message-ID: c3954e28-0c38-4df8-b76e-ec09f6c04021@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's an updated patch series, with only minor changes to fix the mbind
issues:

1) It uses the correct nodemask size, so that the mbind actually binds
the partition to the right node.

2) It aligns the start/end pointers so that there no pages are left with
the default memory policy. So now there should be only the bind:N
entries, not the single-page "default" ones. This means the last page of
a partition can be mapped to a different node, but that seems fine (in
the end it could have happened with the old approach too).

I've also included Jakub's "goodies" patch with the additional GUCs.
Those seem potentially useful to development.

I have some results from a new round of benchmarks, and it's a bit
disappointing. Or rather, there seem to be some issues that I can't
figure out, causing regressions.

Consider a very simple test, doing a lot of sequential scans to put a
fair amount of pressure on the clocksweep / buffer replacement. There's
a .tgz with the benchmark script attached, but it does about this:

* Initialize a pgbench database with scale 2000 (so ~30GB, about twice
the shared buffers).

* Uses --partitions=100, so that the partitions are small enough not to
trigger the 1/4 threshold (i.e. not use circular buffers).

* Does runs with custom script, forcing sequential scans of the table,
with two queries:

select count(1) from pgbench_accounts;

select * from pgbench_accounts offset 1000000000;

Those are called "count" and "offset" in the results. The script forces
serial sequential scans (no index scans, no parallelism), and does runs
with 1, 8 and 32 clients (this is an old-ish xeon with 44 physical cores
on two sockets, 2 NUMA nodes).

I did runs with "master" and the all the 7 patches, with the NUMA stuff
enabled/disabled since 0003 (which adds it). See the two PDFs with more
complete results, but here's the "count" query for a subset of the
patches (the omitted ones behave similarly to what's shown here).

This chart is for median latency (in milliseconds):

clients master 0003 0004 0003/on 0004/on
-------------------------------------------------------------
1 12767 12582 14509 12807 15307
8 14383 14355 14149 14069 16165
32 14756 15198 14836 14984 17128
--------------------------------------------------------
1 103% 114% 100% 120%
8 101% 98% 98% 112%
32 102% 101% 102% 116%

The percentages are compared to "master", the columns with "/on" are
with shared_buffers_numa=on.

Clearly, there's no chance with 0003 (which binds shared buffer
partitions to NUMA nodes, even if that's enabled). The differences are
within noise, pretty much, for all client counts.

Then 0004 gets applied, which partitions the clock sweep. And well, that
doesn't go particularly well. There is a bit of a regression even with
numa=off, but it kinda recovers with the following patches. But with
numa=on, there's a consistent ~10% regression (give or take).

I've spent a fair bit of time investigating what's causing this, but so
far I have nothing. I assume it's something silly in the patches
partitioning the clocksweep, or maybe the approach is flawed in some
way. Not sure :-(

regards

--
Tomas Vondra

Attachment Content-Type Size
v20260624-0001-Add-shmem_populate-and-shmem_interleave-GU.patch text/x-patch 4.9 KB
v20260624-0002-Infrastructure-for-partitioning-of-shared-.patch text/x-patch 14.3 KB
v20260624-0003-NUMA-shared-buffers-partitioning.patch text/x-patch 26.8 KB
v20260624-0004-clock-sweep-basic-partitioning.patch text/x-patch 34.0 KB
v20260624-0005-clock-sweep-balancing-of-allocations.patch text/x-patch 27.4 KB
v20260624-0006-clock-sweep-scan-all-partitions.patch text/x-patch 6.2 KB
v20260624-0007-Add-parttioned-clocksweep-and-NUMA-goodies.patch text/x-patch 14.6 KB
numa-scripts.tgz application/x-compressed-tar 1.2 KB
numa- median latency.pdf application/pdf 37.5 KB
numa - tps.pdf application/pdf 38.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Heikki Linnakangas 2026-06-24 20:25:59 Re: Interrupts vs signals