Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

From: Greg Burd <greg(at)burd(dot)me>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date: 2025-08-12 21:42:35
Message-ID: B34F7A3F-A407-47B8-B19D-0B62BE8B4AA2@getmailspring.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Aug 11 2025, at 9:09 am, Tomas Vondra <tomas(at)vondra(dot)me> wrote:

> Hi,
>
> I took a look at this thread / patches, mostly because it's somewhat
> related to the NUMA patch series [1]. Freelist is one of the things the
> NUMA patches tweak, to split it into per-node partitions. So removing
> the freelist would affect the NUMA patches, mostly simplifying it by
> making the freelist part irrelevant.

Thank you Tomas for taking the time to review and performance test this
patch set. I greatly appreciate your feedback.

> On the whole the idea seems reasonable - I'm not against removing the
> freelist, assuming Andres is right and the freelist is not very useful
> anyway. So I decided to validate this assumption by running a couple of
> benchmarks ...

Amazing, thank you. I'll try to replicate your tests tomorrow to see if
my optimized division and modulo functions do in fact help or not. I
realize that both you and Anders are (rightly) concerned that the
performance impact of IDIV on some CPUs can be excessive.

I'm going to post this new set of patches and then start a separate
email to analyze and respond to the performance tests.

> benchmarks
> ----------
>
> I did two tests:
>
>
> 1) run-pgbench.sh
>
> - read-only pgbench
>
> - scale set to different fraction of shared buffers (50%, 90%, 110%)
>
> - tracks tps
>
>
> 2) run-seqscan.sh
>
> - concurrect evic+seqscans on different tables
>
> - shared buffers set to different fractions of total dataset sizes (50%,
> 90%, 110%)
>
> - This is similar to the workload Andres suggested in [2], except that
> it uses a simple SELECT instead of pg_prewarm. The patches change how
> pg_prewarm decides to stop, and I didn't want to be affected by that.
>
> - tracks tps, and average latency for the evict + select
>
>
> The attached test scripts are "complete" - setup an instance. disable
> checksums, set a couple GUCs, etc.
>
> I wanted to see the effect of each patch, so tests were done on master,
> and then with patches applied one by one (after fixing the compile
> failures in 0001).
>
> I did this on two EPYC machines with 96 cores (lscpu attached), and the
> results are virtually the same. Attached are PDFs with results from one
> of the machines.
>
> On the whole, the impact of the patches is negligible it's within 1% in
> either direction, and I don't see any particular trend / systemic change
> in the behavior (e.g. regressions for some cases). Seems like a noise,
> which supports the assumption the freelist is not very useful.
>
> The one exception positive is "evict latency" which tracks latency of
> pg_buffercache_evict_relation(). That got consistently a little bit
> better, which makes sense as it does not need to maintain the freelist.
>
>
> freelist statistics?
> --------------------

I like this idea too, I'll try to work it into the next set of patches.

> There's one caveat, though. I find it tricky to "know" if the workload
> actually uses a freelist. Is there a good way to determine if a buffer
> was acquired from a freelist, or through the clocksweep?
>
> I'm worried the tests might actually have empty freelists, or maybe the
> freelists won't be used very much. In which case removing the freelist
> has understandably no impact. But it says nothing about cases getting
> buffers from freelists often ...
>
> I can probably think about each workload and deduce if there will be
> freelists. For the benchmarks described earlier, I think the situation
> is this:
>
> pgbench, shared buffers < 100% -> no freelists, uses clock-sweep
> pgbench, shared buffers > 100% -> freelists, but unused (no evictions)
>
> seqscan, shared buffers < 100% -> freelists, but clocksweep too
> seqscan, shared buffers > 100% -> freelists, no clocksweep
>
> But maybe I got it wrong for some cases? It'd be good to have a way to
> collect some stats for each test, to confirm this and also to quantify
> the effects. A test that gets 10% of buffers from a freelist may behave
> differently from a test that gets 99% buffers from a freelist.
>
> I propose we add a system view showing interesting information about
> freelists and buffer allocation, or perhaps extend an existing one (e.g.
> pg_stat_bgwriter, which already has buffers_alloc). In the NUMA patches
> I simply added a new view into pg_buffercache.
>
> I'm not suggesting this would get committed, at this point I'm more
> focused on the development. Whether the view would be useful outside the
> development is an open question. Also, if it adds stats about freelists
> (for current master), that'd be useless once we remove them.
>
>
> Now, a couple review comments about the individual patches:

Apologies for not testing each commit separately.

> 0001
> ----
>
> 1) nitpick: adds a blank line at the end of buffer/README

Removed.

> 2) gcc complains about missing have_free_buffer prototype, but it's no
> no longer needed and can be removed

Removed.

> 3) freelist fails to compile, because of:
>
> freelist.c: In function ‘have_free_buffer’:
> freelist.c:166:51: error: passing argument 1 of ‘pg_atomic_read_u64’
> from incompatible pointer type [-Wincompatible-pointer-types]
> 166 | uint64 hand =
> pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
> |
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> | |
> |
> pg_atomic_uint32 *
>
> That is, it should be accessed using pg_atomic_read_u32.

Removed.

> CI/cfbot does not report this, because it applies all the patches, and
> 0002 fixes this.

That makes sense, I'll be more careful with each patch in the future.

> 0002
> ----
>
> 1) I think the buffer/README needs a bit more work. It now suddenly
> mentions completePasses, without explaining what it's for, and how it
> works with nextVictimBuffer. And then it's not mentioned again, so why
> even mention it? We don't mention various other fields ...

I've taken a stab at expanding on the new approach, happy to continue to
refine the wording.

> It's true completePasses is not a new thing, and probably should have
> been already explained by the README. Still, it seems a bit confusing.

I explain the idea of "complete passes" and how it is now encoded in the
clockSweepCounter (spoiler, that's the new name for nextVictimBuffer).

> 2) It'd be good if the README explained how we coordinate access to
> multiple atomic values (with the spinlock removed), why the current
> approach is safe, or how it's made safe. Or at least mention that we
> though about it, and where to look for details (e.g. in a comment for
> particular function).

I think I've covered this base as well, let me know if it is still
lacking important content.

> 3) I have no opinion on how "clock-sweep" should be spelled, but if we
> want to unify the spelling, it should be done in a separate preparatory
> patch, so that it does not mix with the actual changes. It adds quite a
> bit of changes.

Absolutely, I should have made that the first patch in the set. Now it is.

> 3) I'm not sure about: Assert(prev_strategy_passes <= strategy_passes);
>
> Either it's unlikely but possible, and then it should be elog(ERROR), or
> we consider it impossible (and then assert is fine). Maybe just rephrase
> the comment to say we consider the overflow it impossible.

I've rephrased it to state "overflow is impossible" but memorialized in
an Assert for good measure.

> Also, no need to split the passes_delta assignment, I think. It's fine
> to subtract before an assert - it'll be bogus/negative, but so what?
> It's not an undefined / invalid memory access or anything like that.

Okay, works for me.

> 4) nitpick: freelist.c has incorrect formatting of "{" in a couple
> places, e.g. it should be on a newline in a struct definition, etc.
> pgindent would eventually fix this, but it bugs me.

Yep, guilty as charged. I forgot that last pgindent run before git
format-patch. It bugs me too, fixed.

> 5) I don't like the CLOCK_HAND_POSITION() naming very much, partially
> because it's not really clear what "clock hand" is. Intuitively I know
> what it's supposed to mean, also because I worked with this code before.
> But not everyone has that benefit, and may assume something else.

Now that the nextVictimBuffer is renamed to clockSweepCounter the new
macro names in 0003 are CLOCKSWEEP_HAND() and CLOCKSWEEP_PASSES(). In
0004 those turn into functions ClockSweepHand() and ClockSweepPasses().

> IMO if nextVictimBuffer is no longer meant to be "buffer", but a counter
> that only ever increases, we should call it something different. Say,
> "clockCweepCounter"?

Great idea and a good suggested name, renamed to clockSweepCounter.

> Then CLOCK_HAND_POSITION() could be CLOCKSWEEP_NEXT_BUFFER()? And we why
> not to have a separate CLOCKSWEEP_COMPLETE_PASSES() macro too?

Agreed, done.

> 6) I don't understand why the comment mentions the BufferDescriptor
> array at all? Sure, we may look at the descriptor, but why is this
> detail relevant in this place?

It's not, removed.

> Also, shouldn't it say "divide by NBuffers" at the end?

Also cleaned up.

> 7) CLOCK_HAND_POSITION() should probably have a comment explaining why
> it does the masking, etc. And why it's better than simply doing the
> modulo on the counter directly. I mean, why is this better than just
> doing (counter % NBuffers)? It's going to work on uint64 anyway.

Took a swing at comments, not thrilled but better than nothing.

> 8) There was a discussion about doing the modulo in a better way, but I
> don't see any message explaining it clearly enough for me. And then
> there were some issues with it, and I guess the idea was abandoned.

Well, I made an attempt at implementing the algorithm in the
Granlund-Montgomery paper "Division by invariant Integers using
Multiplication"[1] and got it to work eventually (attached as
v13-0004b.txt to avoid being recognized as part of this patch set,
rename to
v13-0004-Optimize-modulo-and-division-used-in-clock-sweep.patch). In my
approach I created logic that had two paths for division and modulo, one
when NBuffers was a power-of-2 and one when it wasn't. I didn't mind
having a branch in the inline'ed functions because it's likely that the
branch predictor will get it right 100% of the time.

Then I found an implementation of the paper called "fastdiv"[2] that
doesn't branch but has a few more instructions. It is released under
the MIT License. I'm not sure what the convention here in PostgreSQL
code is when including some code inspired by another project (although
now is a good time for me to search the wiki and find out). Both my
implementation and the fastdiv implementation work and should be
considerably faster on modern CPUs. I've attached the fastdiv version,
but I can revert to mine if that's easier or more in line with community
standards. I'd estimate the fastdiv version to run in about ~12-18
cycles (vs 26-90 for hardware division or modulo) on most CPU architectures.

> I'm asking because my NUMA patches do a modulo in a couple places in the
> clock-sweep part too, so if there's a better way, I'd like to know.

Tell me your thoughts on that algorithm, I think it does the job.

> 9) Why initialize "hand" to UINT64_MAX? Seems unnecessary.

Yep, removed.

> 10) nitpick: StrategySyncStart formatting is a bit wrong

Yes, fixed.

>
> 0003
> ----
>
> 1) I don't quite understand the purpose of this part. How is this
> abstracting anything, compared to what we already have (or what the two
> earlier parts did)?
>
> And if it "abstracts the algorithm", what would be the purpose? Do we
> expect some alternative algorithms? And would 0003 really make it easier
> compared to just changing the code directly? I don't get it.
>
>
> 2) I don't understand why we need to explicitly pass the "clock" to
> every ClockSweepCycles/ClockSweepPosition call.
>
> I mean, there's only ever one instance anyway - at least without "my"
> NUMA patches. But even with the NUMA patches, the partition is
> determined inside those function, otherwise every caller would have to
> repeat that. I think this is unnecessary/inconvenient.

Okay, I understand that. At one point I had implementations for
ClockPro and SIEVE but you're right.

> regards
>
>
> [1]
> https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017%40vondra.me
>
> [2]
> https://www.postgresql.org/message-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3%40euyewe6tf3z3
>
>
> --
> Tomas Vondra

Thanks again for digging into the code. I took on this project to help
your NUMA work, but then once I got started it seemed like a good idea
even if that doesn't land as this should remove contention across cores
and hopefully be generally faster.

-greg

[1] https://gmplib.org/~tege/divcnst-pldi94.pdf
[2] https://github.com/jmtilli/fastdiv

Attachment Content-Type Size
v13-0001-Use-consistent-naming-of-the-clock-sweep-algorit.patch application/octet-stream 6.7 KB
v13-0003-Remove-the-need-for-a-buffer_strategy_lock.patch application/octet-stream 15.8 KB
v13-0004-Optimize-modulo-and-division-used-in-clock-sweep.patch application/octet-stream 5.9 KB
v13-0004b.txt text/plain 6.4 KB
v13-0002-Eliminate-the-freelist-from-the-buffer-manager-a.patch application/octet-stream 17.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2025-08-12 21:52:17 Re: index prefetching
Previous Message Peter Geoghegan 2025-08-12 21:22:20 Re: index prefetching