Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention

From: Andres Freund <andres(at)anarazel(dot)de>
To: Greg Burd <greg(at)burd(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>, Nathan Bossart <nathandbossart(at)gmail(dot)com>
Subject: Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention
Date: 2026-04-27 14:15:47
Message-ID: nwri4oxfusruo7yog2rewbeiegs4rhclkjukn54mcumiilak4b@6chlhxypm5sq
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for looking into this.

On 2026-04-25 16:08:02 -0400, Greg Burd wrote:
> Does batching mess up the meaning of usage_count?
> --------------------------------------------------
>
> Short answer: no. I want to walk through this because it was my first
> concern too, and I think it's the question that will come up most on review.

>
> The clock sweep's usage_count is an access-frequency approximation measured
> in units of *complete passes*. A buffer with usage_count = N survives N
> passes without a re-pin. The semantic meaning lives at pass granularity,
> not at individual-buffer granularity.

>
> What batching changes: intra-pass temporal ordering. Without batching, with
> N backends sweeping, decrements are interleaved -- backend A hits B[0],
> backend B hits B[1], backend C hits B[2]. With batching, backend A hits
> B[0..63] in a tight local burst, then backend B hits B[64..127], etc. The
> 64-buffer chunks are decremented in bursts rather than individually.

>
> Why it doesn't matter:
>
> 1. Every buffer still gets decremented exactly once per complete
> pass. The invariant the algorithm actually depends on is
> untouched.
>
> 2. A buffer's survival window is the time between consecutive
> passes. That's milliseconds to seconds under load. Whether
> B[0] gets decremented 50us before or 50us after B[63] within
> the same pass is below the resolution of anything usage_count
> is trying to measure.

I don't think this is true, unfortunately. Sure, if you have a completely
uniform, IO intensive, workload it is, but that's not all there is. If you
have a bunch of connections that replace buffers at a low rate and a bunch of
connections that do so at a high rate, the batches "checked out" by the "low
rate" connections won't be processed soon. Thus the buffers in that batch
won't have their usagecount decremented and thus have a stronger "protection"
against replacement.

You can argue that may be OK, because it'd be unlikely that the next sweep
would assign the same buffers to an "low rate" backend again. But that'd be an
argument you'd have to make and validate.

I'm somewhat doubtful that batching that's independent of contention and
independent of the usage rate will work out all that well. If you instead
went for a partitioned sweep architecture, with balancing between the
different partitions, you don't have that issue. And you have a building block
for more numa awareness etc.

> There is one subtle difference worth naming. When a backend finds a victim at B[5] of its batch, it returns with MyBatchEnd still sitting at B[63]. The next time that backend needs a victim it resumes at B[6], not at wherever the global hand now points. So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at once. Under heavy load, where batches are consumed in microseconds, this is invisible. Under light load, the implication is that some buffers can sit with slightly stale usage_count for longer than they would have before. But "light load" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
> doesn't show up where it would hurt.

As mentioned above, this assumes that the replacement rate is uniform between
backends, which I think is not uniformly true outside of benchmarks.

> There's also a small positive side-effect: cache locality. The backend that
> just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in
> L1/L2.

A BufferDesc is 64bytes. With common cacheline sizes and stuff like adjacent
cacheline prefetching you'll have *maybe* 2 consecutive BufferDescs in L1/L2.
Where it might help more is the TLB.

> Benchmarks
> ----------
>
> Jim ran these; I'm still working on reproducing them locally and will post
> independent numbers in a follow-up. All bare metal, Linux, huge pages
> enabled throughout (more on that below), postmaster pinned to node 0 with
> `numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K
> depending on which node the postmaster happened to land on at launch --
> worth flagging for anyone trying to reproduce.

That's an odd one that I think you need to investigate separately.

> Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
> working set always spills and the sweep is hot.

Uhm, is this something worth optimizing substantially for? What you're
measuring here is basically the worst possible way of configuring a database,
with full double buffering and a lot of memory bandwidth dedicated to copying
buffers from one place to another. That's maybe a sane setup if you have a lot
of small databases that you can't configure individually, but that's not the
case when you run a reasonably large workload on a 384vCPU setup.

I think to be really convincing you'd have to do this with actual IO involved
somewhere.

> Relationship to Tomas's NUMA series
> -----------------------------------
>
> Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned freelists, partitioned clock sweep, PGPROC interleaving, and related pieces. I want to be careful here because I don't think we should frame this patch as competing with that work.
>
> One thing I found striking as I re-read the thread: in the benchmarks Tomas
> posted later in the series, *most of the benefit comes from partitioning the
> clock sweep*, and the NUMA memory-placement layer on top sometimes runs
> slower than partitioning alone. His own conclusion, quoted roughly: the
> benefit mostly comes from just partitioning the clock sweep, and it's
> largely independent of the NUMA stuff; the NUMA partitioning is often
> slower.

That was partially because he measured on something that didn't really have
significant NUMA effects though...

> That observation is the thing that makes me think batching is worth
> considering on its own. It's going after the same bottleneck Tomas's
> partitioning addresses, but:
>
> - without splitting global eviction visibility (which is where
> cross-partition stealing gets complicated),

You *are* doing that tho.

> - without requiring NUMA-aware buffer placement (which has huge
> page alignment, descriptor-partition-mid-page, and resize
> complications that are still being worked out in that thread),

You can do the partitioned clock sweep without *any* of that.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nisha Moond 2026-04-27 15:25:36 Re: Proposal: Conflict log history table for Logical Replication
Previous Message Greg Burd 2026-04-27 13:26:12 Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention