Quick Links

Re: Adding basic NUMA awareness

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Greg Burd <greg(at)burd(dot)me>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject:	Re: Adding basic NUMA awareness
Date:	2025-07-18 16:46:44
Message-ID:	r4mulzyq7oboqbhddapssyyg7stgteju2qflcks75g6fmfnqmi@leqbxalg4pov
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote:
> Here's a v2 of the patch series, with a couple changes:

Not a deep look at the code, just a quick reply.

> * I changed the freelist partitioning scheme a little bit, based on the
> discussion in this thread. Instead of having a single "partition" per
> NUMA node, there's not a minimum number of partitions (set to 4). So

I assume s/not/now/?

> * There's now a patch partitioning clocksweep, using the same scheme as
> the freelists.

Nice!

> I came to the conclusion it doesn't make much sense to partition these
> things differently - I can't think of a reason why that would be
> advantageous, and it makes it easier to reason about.

Agreed.

> The clocksweep partitioning is somewhat harder, because it affects
> BgBufferSync() and related code. With the partitioning we now have
> multiple "clock hands" for different ranges of buffers, and the clock
> sweep needs to consider that. I modified BgBufferSync to simply loop
> through the ClockSweep partitions, and do a small cleanup for each.

That probably makes sense for now. It might need a bit of a larger adjustment
at some point, but ...

> * This new freelist/clocksweep partitioning scheme is however harder to
> disable. I now realize the GUC may quite do the trick, and there even is
> not a GUC for the clocksweep. I need to think about this, but I'm not
> even how feasible it'd be to have two separate GUCs (because of how
> these two pieces are intertwined). For now if you want to test without
> the partitioning, you need to skip the patch.

I think it's totally fair to enable/disable them at the same time. They're so
closely related, that I don't think it really makes sense to measure them
separately.

> I did some quick perf testing on my old xeon machine (2 NUMA nodes), and
> the results are encouraging. For a read-only pgbench (2x shared buffers,
> within RAM), I saw an increase from 1.1M tps to 1.3M. Not crazy, but not
> bad considering the patch is more about consistency than raw throughput.

Personally I think an 1.18x improvement on a relatively small NUMA machine is
really rather awesome.

> For a read-write pgbench I however saw some strange drops/increases of
> throughput. I suspect this might be due to some thinko in the clocksweep
> partitioning, but I'll need to take a closer look.

Was that with pinning etc enabled or not?

> From c4d51ab87b92f9900e37d42cf74980e87b648a56 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas(at)vondra(dot)me>
> Date: Sun, 8 Jun 2025 18:53:12 +0200
> Subject: [PATCH v2 5/7] NUMA: clockweep partitioning
>

> @@ -475,13 +525,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
> /*
> * Nothing on the freelist, so run the "clock sweep" algorithm
> *
> - * XXX Should we also make this NUMA-aware, to only access buffers from
> - * the same NUMA node? That'd probably mean we need to make the clock
> - * sweep NUMA-aware, perhaps by having multiple clock sweeps, each for a
> - * subset of buffers. But that also means each process could "sweep" only
> - * a fraction of buffers, even if the other buffers are better candidates
> - * for eviction. Would that also mean we'd have multiple bgwriters, one
> - * for each node, or would one bgwriter handle all of that?
> + * XXX Note that ClockSweepTick() is NUMA-aware, i.e. it only looks at
> + * buffers from a single partition, aligned with the NUMA node. That
> + * means it only accesses buffers from the same NUMA node.
> + *
> + * XXX That also means each process "sweeps" only a fraction of buffers,
> + * even if the other buffers are better candidates for eviction. Maybe
> + * there should be some logic to "steal" buffers from other freelists
> + * or other nodes?

I think we *definitely* need "stealing" from other clock sweeps, whenever
there's a meaningful imbalance between the different sweeps.

I don't think we need to be overly precise about it, a small imbalance won't
have that much of an effect. But clearly it doesn't make sense to say that one
backend can only fill buffers in the current partition, that'd lead to massive
performance issues in a lot of workloads.

The hardest thing probably is to make the logic for when to check foreign
clock sweeps cheap enough.

One way would be to do it whenever a sweep wraps around, that'd probably
amortize the cost sufficiently, and I don't think it'd be too imprecise, as
we'd have processed that set of buffers in a row without partitioning as
well. But it'd probably be too coarse when determining for how long to use a
foreign sweep instance. But we probably could address that by rechecking the
balanace more frequently when using a foreign partition.

Another way would be to have bgwriter manage this. Whenever it detects that
one ring is too far ahead, it could set a "avoid this partition" bit, which
would trigger backends that natively use that partition to switch to foreign
partitions that don't currently have that bit set. I suspect there's a
problem with that approach though, I worry that the amount of time that
bgwriter spends in BgBufferSync() may sometimes be too long, leading to too
much imbalance.

Greetings,

Andres Freund

In response to

Re: Adding basic NUMA awareness at 2025-07-17 21:11:16 from Tomas Vondra

Responses

Re: Adding basic NUMA awareness at 2025-07-18 20:48:00 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-07-18 17:03:22	Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Previous Message	Nathan Bossart	2025-07-18 16:27:43	Re: Verify predefined LWLocks tranches have entries in wait_event_names.txt