Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

From: Greg Burd <greg(at)burd(dot)me>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date: 2025-08-27 19:42:48
Message-ID: 70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Aug 17 2025, at 12:57 am, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Sun, Aug 17, 2025 at 4:34 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> Or if you don't like those odds, maybe it'd be OK to keep % but use it
>> rarely and without the CAS that can fail.
>
> ... or if we wanted to try harder to avoid %, could we relegate it to
> the unlikely CLOCK-went-all-the-way-around-again-due-to-unlucky-scheduling
> case, but use subtraction for the expected periodic overshoot?
>
> if (hand >= NBuffers)
> {
> hand = hand < Nbuffers * 2 ? hand - NBuffers : hand % NBuffers;
> /* Base value advanced by backend that overshoots by one tick. */
> if (hand == 0)
> pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
> }
>

Hi Tomas,

Thanks for all the ideas, I have tried out a few of them and a number of
other ideas. I've done a lot of measurement and had a few off channel
discussions about this and I think the best way to move forward is to
just focus on the removal of the freelist and not bother with the lock
or changing clock-sweep right now too much. So, the attached patch set
keeps the first two from the last set but drops the rest.

But wait, there's more...

As a *bonus* I've added a new third patch with some proposed changes to
spark discussions. As I researched experiences in the field at scale a
few other buffer management issues came to light. The one in particular
that I tried to address in this new patch 0003 has to do with very large
shared_buffers (NBuffers) and very large active datasets causing most
buffer usage counts to be at or near the max value (5). In these cases
the clock-sweep algorithm needs to perform NBuffers * 5 "ticks" before
identifying a buffer to evict. This also pollutes the completePasses
value used to inform the bgwriter where to start working.

So, in this patch I add per-backend buffer usage tracking and proactive
pressure management. Each tick of the hand can now decrement usage by a
calculated amount, not just 1, based on /hand-wavy-first-attempt at magic/.

The thing I'm sure this doesn't help with, and may in fact hurt, is
keeping frequently accessed buffers in the buffer pool. I imagine a two
tier approach to this where some small subset of buffers that are reused
frequently enough are not even considered by the clock-sweep algorithm.

Regardless, I feel the first two patches on this set address the
intention of this thread. I added patch 0003 just to start a
conversation, please chime in if any of this interests you. Maybe this
new patch should take on a life of its own in a new thread? If anyone
thinks this approach has some merit, I'll do that.

I look forward to thoughts on these idea, and hopefully to finding
someone willing to help me get the first two over the line.

best.

-greg

Attachment Content-Type Size
v14-0001-Use-consistent-naming-of-the-clock-sweep-algorit.patch application/octet-stream 6.7 KB
v14-0002-Eliminate-the-freelist-from-the-buffer-manager-a.patch application/octet-stream 17.7 KB
v14-0003-Track-buffer-usage-per-backend-and-use-that-to-i.patch application/octet-stream 21.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-08-27 19:49:34 Re: Serverside SNI support in libpq
Previous Message Andres Freund 2025-08-27 19:29:02 Re: Buffer locking is special (hints, checksums, AIO writes)