| From: | "Greg Burd" <greg(at)burd(dot)me> |
|---|---|
| To: | "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Cc: | "Andres Freund" <andres(at)anarazel(dot)de>, "Tomas Vondra" <tomas(at)vondra(dot)me>, "Nathan Bossart" <nathandbossart(at)gmail(dot)com> |
| Subject: | Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention |
| Date: | 2026-04-27 13:26:12 |
| Message-ID: | 3d85395b-698e-4f0e-873e-17b3f277f7c7@app.fastmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Sat, Apr 25, 2026, at 4:08 PM, Greg Burd wrote:
> Hello hackers,
Hi again, attached is v2:
0001 - unchanged, batches clock-sweep to reduce contention
0002 - changed ComputeClockBatchSize() such that non-NUMA multi-core systems use batches as well and no longer default to batch size 1
Details below...
> A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior
> on some of the newer AWS bare-metal instance types (r8i in particular,
> which exposes 6 NUMA nodes via SNC3 on a 2-socket box), and in the
> process landed on a very small change to freelist.c that I think is
> worth showing around. His patch is attached with some tweaks of my own.
>
> Full disclosure: the exploration that led Jim to this patch idea was
> done with help from an AI assistant (Kiro); the idea, the benchmarking,
> and the final shape of the patch are human-driven, but I wanted to be
> up front about how his investigation started. Happy to discuss that
> separately if people want to.
>
> The one-line summary: instead of advancing nextVictimBuffer one buffer
> at a time via pg_atomic_fetch_add_u32, each backend claims a batch of
> 64 consecutive buffer IDs from the shared hand and then iterates them
> privately. Global sweep order is preserved -- every buffer is still
> visited exactly once per complete pass -- but the atomic contention on
> that one cache line drops by roughly the batch size.
>
>
> Why this matters
> ----------------
>
> On multi-socket boxes under eviction pressure, every backend that needs
> a victim buffer ends up CAS'ing the same cache line. On a single
> socket, a locked RMW on that cache line stays warm in L1/L2 and
> completes in ~20ns. On 2+ sockets, the line bounces over QPI/UPI at
> ~100-200ns per op, and with hundreds of backends running
> StrategyGetBuffer() concurrently, the line ping-pongs constantly. It's
> a textbook NUMA scalability bottleneck, and once shared_buffers is
> smaller than the working set and the sweep is running continuously,
> that single atomic is what you hit in a perf profile (elevated
> bus-cycles, cache-misses on the cache line holding nextVictimBuffer).
>
> Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas
> called it out in the "Adding basic NUMA awareness" thread [1] -- so
> this isn't news to anyone who's been looking at this area. What I
> think is new is a fix that's just this, without any of the surrounding
> architectural change.
>
> The framing (credit to Jim): the clock hand is doing two jobs. It
> *coordinates* backends so they don't redundantly decrement usage_count
> on the same buffers and so they eventually visit every buffer in the
> pool exactly once per pass. It also *serializes* access to the
> counter. Coordination is the part we want. Serialization is the part
> that's killing us on bigger NUMA boxes. Batching keeps the
> coordination and thins out the serialization.
>
>
> How it works
> ------------
>
> Two per-backend statics, MyBatchPos and MyBatchEnd. When a backend
> calls ClockSweepTick() and its local batch is exhausted, it does a
> single fetch-add of CLOCK_SWEEP_BATCH_SIZE (64) against
> nextVictimBuffer and now owns that range. Subsequent ticks just bump
> the local counter.
>
> Wraparound got a small rewrite. The original code had the backend that
> crossed NBuffers drive completePasses++ under the spinlock via a CAS
> loop. With batching, multiple backends can each land a fetch-add that
> returns a value >= NBuffers in the same pass, so the logic now is:
> whoever sees a start >= NBuffers takes the spinlock, re-reads the
> counter, and if it's still out of range does a single CAS to wrap it
> and bumps completePasses. If somebody else already wrapped, we just
> release and move on. StrategySyncStart() still sees a consistent
> (nextVictimBuffer, completePasses) pair.
>
> The batch size is gated on whether we actually have multiple NUMA
> nodes. On a single-socket box the atomic is already socket-local,
> batching just makes backends skip further ahead than they need to, so
> we fall back to batch size 1 -- which is bit-for-bit the original
> behavior. The guard:
>
> if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
> ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE, (uint32) NBuffers);
> else
> ClockSweepBatchSize = 1;
>
> Min() against NBuffers covers the small-shared_buffers corner so a
> batch never wraps the pool multiple times in one claim.
Thinking more about this approach led me to believe that this non-NUMA default is wrong and induces overhead for a very common case.
> Does batching mess up the meaning of usage_count?
> --------------------------------------------------
>
> Short answer: no. I want to walk through this because it was my first
> concern too, and I think it's the question that will come up most on
> review.
>
> The clock sweep's usage_count is an access-frequency approximation
> measured in units of *complete passes*. A buffer with usage_count = N
> survives N passes without a re-pin. The semantic meaning lives at pass
> granularity, not at individual-buffer granularity.
>
> What batching changes: intra-pass temporal ordering. Without batching,
> with N backends sweeping, decrements are interleaved -- backend A hits
> B[0], backend B hits B[1], backend C hits B[2]. With batching, backend
> A hits B[0..63] in a tight local burst, then backend B hits B[64..127],
> etc. The 64-buffer chunks are decremented in bursts rather than
> individually.
>
> Why it doesn't matter:
>
> 1. Every buffer still gets decremented exactly once per complete
> pass. The invariant the algorithm actually depends on is
> untouched.
>
> 2. A buffer's survival window is the time between consecutive
> passes. That's milliseconds to seconds under load. Whether
> B[0] gets decremented 50us before or 50us after B[63] within
> the same pass is below the resolution of anything usage_count
> is trying to measure.
>
> 3. The bgwriter's feedback loop reads (nextVictimBuffer,
> completePasses, numBufferAllocs) via StrategySyncStart() every
> ~200ms. nextVictimBuffer still advances at the same *total*
> rate (64 per atomic op, but atomic ops happen 1/64 as often).
> The position it reports can jitter by up to 64 buffers relative
> to the one-at-a-time case, but BgBufferSync()'s smoothed
> estimates operate over thousands of buffers per cycle, so the
> jitter disappears into the averaging. numBufferAllocs still
> increments once per allocation. strategy_delta,
> smoothed_alloc, smoothed_density, reusable_buffers_est -- all
> unaffected in any way I can see.
>
> Table form, because it's easier to argue with:
>
> Property | Unpatched | Batched
> ----------------------------------+----------------+----------------
> Buffers visited per pass | NBuffers | NBuffers
> Decrements per buffer per pass | 1 | 1
> Eviction threshold | usage_count==0 | usage_count==0
> Max survival (passes) | 6 | 6
> Decrement ordering within a pass | interleaved | chunked
> bgwriter allocation rate signal | accurate | accurate
> Cross-socket atomic traffic | 1 per buffer | 1 per 64
>
> There is one subtle difference worth naming. When a backend finds a
> victim at B[5] of its batch, it returns with MyBatchEnd still sitting
> at B[63]. The next time that backend needs a victim it resumes at
> B[6], not at wherever the global hand now points. So the backend
> drains its batch over multiple StrategyGetBuffer() calls rather than
> all at once. Under heavy load, where batches are consumed in
> microseconds, this is invisible. Under light load, the implication is
> that some buffers can sit with slightly stale usage_count for longer
> than they would have before. But "light load" means "the sweep is
> barely moving and nothing wants to evict anyway" -- so the effect
> doesn't show up where it would hurt.
>
> There's also a small positive side-effect: cache locality. The backend
> that just touched BufferDescriptor[B[0]] has the adjacent descriptors
> warm in L1/L2. Walking B[0..63] locally is cheaper than walking a
> striped interleaving where each descriptor was last touched by a
> different core. I haven't tried to isolate this in perf, but it falls
> out naturally.
>
>
> Benchmarks
> ----------
>
> Jim ran these; I'm still working on reproducing them locally and will
> post independent numbers in a follow-up. All bare metal, Linux, huge
> pages enabled throughout (more on that below), postmaster pinned to
> node 0 with `numactl --cpunodebind=0` because otherwise stock TPS
> varied from 31K to 40K depending on which node the postmaster happened
> to land on at launch -- worth flagging for anyone trying to reproduce.
>
> Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
> working set always spills and the sweep is hot.
>
> r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):
>
> pgbench RO:
> Clients Stock Patched Delta
> 64 31,457 36,353 +16%
> 128 31,678 37,864 +20%
> 256 31,510 37,558 +19%
> 384 31,431 37,464 +19%
> 512 31,329 37,040 +18%
>
> pgbench RW:
> Clients Stock Patched Delta
> 64 7,685 7,713 0%
> 128 10,420 10,541 +1%
> 256 12,393 12,463 +1%
> 384 15,317 15,197 -1%
> 512 17,930 17,978 0%
>
> m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
> RO +19-20%, RW within noise.
>
> c8i.metal-48xl (192 vCPUs, 1 socket):
> Single-socket -> batch_size=1 -> original code path. No
> behavioral change. (I double-checked this one specifically
> because it's the sanity test for the gate.)
>
> HammerDB TPC-C on m6i.metal (1000 warehouses):
> VUs Stock Patched Delta
> 128 358,518 349,787 -2%
> 256 332,098 330,272 -1%
> 384 365,782 377,519 +3%
> 512 370,663 386,526 +4%
>
> No TPC-C regression, which was the thing we were most worried about. An
> earlier attempt (per-socket partitioned sweep, see below) was -13% on
> this same workload.
>
> The general shape is: the scaling curve flattens later. Unpatched, TPS
> tops out around 128 clients and stays flat up to 512 because backends
> are spending cycles waiting on the cache line rather than
> doing work. Patched, the curve keeps rising past the point where
> unpatched plateaus.
>
> Huge pages caveat: all of the above was run with huge pages on, on
> large-memory instances (the r8i.96xl has 3TB, so Jim never considered
> running without them). We have not characterized the non-huge-pages
> case. That's on my list; I don't expect it to change the conclusion,
> but I shouldn't speak for data I haven't collected.
>
>
> Relationship to Tomas's NUMA series
> -----------------------------------
>
> Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer
> interleaving across nodes, partitioned freelists, partitioned clock
> sweep, PGPROC interleaving, and related pieces. I want to be careful
> here because I don't think we should frame this patch as competing with
> that work.
>
> One thing I found striking as I re-read the thread: in the benchmarks
> Tomas posted later in the series, *most of the benefit comes from
> partitioning the clock sweep*, and the NUMA memory-placement layer on
> top sometimes runs slower than partitioning alone. His own conclusion,
> quoted roughly: the benefit mostly comes from just partitioning the
> clock sweep, and it's largely independent of the NUMA stuff; the NUMA
> partitioning is often slower.
>
> That observation is the thing that makes me think batching is worth
> considering on its own. It's going after the same bottleneck Tomas's
> partitioning addresses, but:
>
> - without splitting global eviction visibility (which is where
> cross-partition stealing gets complicated),
> - without requiring NUMA-aware buffer placement (which has huge
> page alignment, descriptor-partition-mid-page, and resize
> complications that are still being worked out in that thread),
> - without touching PGPROC or bgwriter.
>
> What this patch does *not* do:
> - place buffers on specific NUMA nodes
> - partition the freelist
> - touch PGPROC
> - add new GUCs
> - change bgwriter
>
> What this patch *does* do:
> - target exactly the clock-sweep contention that Tomas's
> partitioning targets, and reduce it by ~64x, in ~30 lines.
>
> If Tomas's series lands in full, this patch becomes redundant for its
> primary use case (though even within a partitioned sweep, the
> per-partition atomic still benefits from batching, so it's arguably a
> useful primitive either way). If Tomas's series lands incrementally
> over several cycles -- which the open items in that thread suggest is
> the realistic path -- this gets us a real chunk of the multi-socket win
> now.
>
> This patch is also orthogonal to my earlier thread about removing the
> freelist entirely [2], but given the proximity to that code Jim agreed
> that I could propose/steward it here on the list for consideration.
>
>
> Open questions / things I'd like feedback on
> --------------------------------------------
>
> - Batch size. 64 is a round number that worked well in testing, but
> Nathan raised the reasonable point that on small shared_buffers
> with high concurrency, a fixed 64 could be unfortunate. Options:
> scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
> with max_connections, keep it fixed but let operators tune it, or
> make it a function of NUMA node count. I don't have a strong
> opinion yet; the Min(batch, NBuffers) cap covers the "obviously
> wrong" corner but doesn't speak to the "several hundred backends
> on a few-MB shared_buffers" shape. Numbers/ideas/proposals welcome.
>
> - NUMA detection. The gate uses pg_numa_init() /
> pg_numa_get_max_node(). On systems where libnuma isn't available,
> or where get_mempolicy is blocked (some container configurations),
> we fall back to batch size 1. That's safe but it misses the
> "single socket, many cores, still benefits from fewer atomics"
> case. Might be worth a way to force-enable, or batching on all
> systems with a smaller batch size when single-socket. I'd like to
> measure before deciding.
>
> - Eviction pattern on reads. Nathan also flagged that with batching,
> the buffers a backend ends up pinning in one StrategyGetBuffer()
> call will tend to be contiguous in buffer-id space rather than
> scattered, which is a different allocation pattern than today.
> The usage_count analysis above says this is benign, but if anyone
> has an intuition for a workload where this would be observable
> (e.g., something that cares about the mapping between buffer-id
> and relation locality), I'd like to hear it.
>
> - nextVictimBuffer wraparound. The current code has a mild overflow
> concern papered over with "highly unlikely and wouldn't be
> particularly harmful". With batching this is no worse than before,
> but if we're already touching this function, it might be worth
> thinking about whether to tighten it up in the same patch or a
> follow-up.
>
> - Should the non-NUMA value for this be derived from core counts that
> imply L1/L2 cache layouts or simply default to 8 rather than 1 to
> realize some benefit?
So, I'm answering my own question here. Yes, it should. Ideas below.
> - Should there be a postgresql.conf setting for this that takes
> precedence?
>
>
> I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at
> the small-shared_buffers corner, and post perf stat output showing the
> atomic/cache-miss deltas over the next few days. In the meantime,
> eyeballs and skepticism welcome -- I would especially welcome comments
> from Andres, who's been in this code recently, and from Tomas, whose
> series has the most overlap.
>
> I realize that we're past feature freeze and working on release notes
> for v19, so the chances of merging this are slim to none. I think this
> could be considered a "performance bug fix for NUMA systems" in this
> release, but that is stretching it a bit. It is a big ask at this
> stage to land a change like this.
>
> best.
>
> -greg
>
> [1]
> https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017@vondra.me
> [2]
> https://www.postgresql.org/message-id/f0e3c02e-e217-4f04-8dab-1e7e80a228c0@burd.me
> Attachments:
> * v1-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch
ComputeClockBatchSize() has two phases: select a base batch from hardware topology, then cap it to prevent over-claiming.
Phase 1: Base batch from topology
int ncpus = pg_get_online_cpus();
int numa_nodes = (pg_numa_init() != -1) ? pg_numa_get_max_node() + 1 : 1;
if (numa_nodes > 1) base_batch = 64;
else if (ncpus > 16) base_batch = 32;
else if (ncpus > 8) base_batch = 16;
else if (ncpus > 4) base_batch = 8;
else base_batch = 1;
The reasoning at each tier:
- NUMA (multi-socket): Atomic ops cross the interconnect (QPI/UPI/Infinity
Fabric). Round-trip latency is ~100-300ns vs ~10-40ns intra-socket.
Batch=64 amortizes that heavily.
- >16 cores, single socket: Still significant L3 contention, many cores
competing for the same cache line. Batch=32 cuts atomic ops by 32x.
- 9-16 cores: Moderate contention. Batch=16.
- 5-8 cores: Light contention. Batch=8.
- <=4 cores: Almost no contention. Batch=1 (no batching). The overhead of
batching logic isn't worth it, and there's a fairness tradeoff - batching
means one backend "owns" a range of buffers temporarily, which matters
more when there are few buffers per backend.
Phase 2: Cap to prevent over-claiming
max_batch = (MaxBackends > 0)
? pool_nbuffers / (2 * MaxBackends)
: pool_nbuffers / 200;
if (max_batch < 1)
max_batch = 1;
return Min(base_batch, Min(max_batch, pool_nbuffers));
The cap ensures that if every backend simultaneously claims a batch, the total claimed doesn't exceed half the pool:
batch_size * MaxBackends <= pool_nbuffers / 2
Why half? If all backends claimed the entire pool simultaneously, they'd each be sweeping overlapping ranges, thus wasting work and defeating the purpose. Keeping total claims under 50% of the pool means at any instant, at most half the buffers are "in flight" being evaluated by backends, and the other half are available for normal operation.
For a small dynamic pool (say 4096 buffers with MaxBackends=200), the cap computes to 4096 / 400 = 10, which overrides any larger base_batch. For the default pool with shared_buffers = 8GB (1M buffers) and MaxBackends=200, the cap is 1000000 / 400 = 2500 which is well above the max base_batch of 64, so the base_batch wins.
The pool_nbuffers floor at the end handles the degenerate case of a pool smaller than the batch size.
The Tradeoff
Larger batches reduce atomic contention but increase sweep unevenness, one backend might sweep through "cold" buffers while another's batch happens to land on "hot" ones. The tiered approach balances this: batch aggressively only when the hardware topology makes contention the dominant cost (NUMA, many-core), and stay conservative on small systems where fairness matters more.
I think this is better because:
1. The original patch only batched on multi-socket NUMA systems. The new algorithm also provides atomic contention benefits on large single-socket systems (>16 cores) where L3 cache contention matters.
2. Conservative on small systems: Systems with ≤4 cores get batch_size=1 (original behavior) since batching overhead outweighs contention benefits and fairness matters more.
3. Prevents pathological over-claiming: The cap mechanism prevents scenarios where many backends claim huge batches relative to a small buffer pool.
Based on the algorithm, here's what different systems would get:
System CPUs NUMA Total RAM Shr Buf Batch Size Atomic Reduction
================== ==== ======== =========== ======== ========== ================
r8i.metal-96xl 384 multi 3072GB 2457.6GB 64 64x
m6i.metal 128 multi 512GB 409.6GB 64 64x
c8i.metal-48xl 192 1 socket 192GB 153.6GB 32 32x
Large server 64 multi 256GB 204.8GB 64 64x
Medium server 32 1 socket 64GB 51.2GB 32 32x
Small server 16 1 socket 32GB 25.6GB 16 16x
Developer machine 8 1 socket 16GB 12.8GB 8 8x
Small VM 4 1 socket 4GB 3.2GB 1 no change
Overloaded VM 8 1 socket 4GB 3.2GB 8 8x
best.
-greg
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch | application/octet-stream | 8.2 KB |
| v2-0002-Improve-clock-sweep-batch-sizing-with-CPU-aware-a.patch | application/octet-stream | 5.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2026-04-27 14:15:47 | Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention |
| Previous Message | Peter Eisentraut | 2026-04-27 12:05:23 | Re: PGConf.dev 2026 In-Person Commitfest ready for patch submissions |