Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

From: Greg Burd <greg(at)burd(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date: 2025-07-25 19:02:39
Message-ID: 1c84de03-7858-4bcb-b104-336355470a43@burd.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/22/25 14:43, Greg Burd wrote:
> On 7/21/25 14:35, Andres Freund wrote:
>> Hi,
>>
>> On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
>>> On 7/18/25 13:03, Andres Freund wrote:
>>> Hello.  Thanks again for taking the time to review the email and patch,
>>> I think we're onto something good here.
>>>
>>>> I'd be curious if anybody wants to argue for keeping the clock sweep. Except
>>>> for the have_free_buffer() use in autoprewarm, it's a rather trivial
>>>> patch. And I really couldn't measure regressions above the noise level, even
>>>> if absurdly extreme use cases.
>>> Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
>>> for keeping the freelist"?
>> Err, yes :(
> Phew.  :)  No worries.
>
>>>> On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
>>>>> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
>>>>>> I think we'll likely need something to replace it.
>>>>> Fair, this (v5) patch doesn't yet try to address this.
>>>>>
>>>>>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
>>>>>> right. The goal of the use have_free_buffer() is obviously to stop
>>>>>> prewarming
>>>>>> shared buffers if doing so would just evict buffers.  But it's not clear
>>>>>> to me
>>>>>> that we should just stop when there aren't any free buffers - what if the
>>>>>> previous buffer contents aren't the right ones?  It'd make more sense to
>>>>>> me to
>>>>>> stop autoprewarm once NBuffers have been prewarmed...
>>>>> I had the same high level reaction, that autoprewarm was leveraging
>>>>> something
>>>>> convenient but not necessarily required or even correct.  I'd considered
>>>>> using
>>>>> NBuffers as you describe due to similar intuitions, I'll dig into that idea
>>>>> for
>>>>> the next revision after I get to know autoprewarm a bit better.
>>>> Cool. I do think that'll be good enough.
>>> I re-added the have_free_buffer() function only now it returns false
>>> once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
>>> has made its first complete pass.  With that I reverted my changes in
>>> the autoprewarm module.  The net should be the same behavior as before
>>> at startup when using that module.
>> I don't think we should have a have_free_buffer() that doesn't actually test
>> whether we have a free buffer, that seems too likely to cause
>> misunderstandings down the line. What if we instead just limit the amount of
>> buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
>> the number of to-be-loaded buffers, so that shouldn't be hard.
> I'm glad you said that, I wasn't thrilled with that either and I'm not
> sure why I didn't just correct for that in the last patch set.  I'm now
> capping num_elements to NBuffers at most.
>
>>>>> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
>>>>> I'll dig into the Windows issues next week as well.
>>>> FWIW, there are backtraces generated on windows. E.g.
>>>>
>>>> https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
>>>>
>>>> 000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
>>>> 000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
>>>> char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
>>>> char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
>>>> int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
>>>> 000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
>>>> unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
>>>> 000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
>>>> unsigned int * complete_passes = 0x000000cd`827fdfc0,
>>>> unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
>>>> 000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
>>>> struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
>>>> 000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
>>>> void * startup_data = 0x00000000`00000000,
>>>> unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
>>>> 000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
>>>> int argc = 0n3,
>>>> char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
>>>> 000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
>>>> int argc = 0n3,
>>>> char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
>>>>
>>>> I.e. your new assertion failed for some reason that i can't *immediately* see.
>>> I put that in as a precaution and as a way to communicate the intention
>>> of the other code above it.  I never imagined it would assert.  I've
>>> changed clock_read() to only assert when the modulo differs and left
>>> that assert in the calling ClockSweepTick() function because it was
>>> redundant and I'm curious to see if we see a similar assert when testing
>>> the modulo.
>> Do you understand why it triggered? Because I don't immediately. The fact that
>> it triggered only on windows, where the compiler is rather different, makes it
>> worth understanding imo.
> I dug into the ASM for both GCC 15.1 and MSVC 19.latest (thanks
> godbolt.org!) for x86_64 and there was a critical difference.  It starts
> with the fact that I'd used uint32 for my NBuffersPow2Mask rather than
> uint64.  That then translates to two different compiled outputs for
> clock_read() (was: clock_modulo()).
>
> gcc-15.1 -O2
>
> clock_read(unsigned long long): and edi, DWORD PTR NBuffersPow2Mask[rip]
> mov edx, DWORD PTR NBuffers[rip] mov rax, rdi sub rax, rdx cmp rdi, rdx
> cmovb rax, rdi ret
>
> msvc-19.latest /O2
>
> hand$ = 8 unsigned int clock_read(unsigned int64) PROC ; clock_read,
> COMDAT mov edx, ecx and rdx, QWORD PTR unsigned int64 NBuffersPow2Mask ;
> NBuffersPow2Mask mov ecx, DWORD PTR unsigned int NBuffers ; NBuffers mov
> eax, edx sub eax, ecx cmp rdx, rcx cmovb eax, edx ret 0 unsigned int
> clock_read(unsigned __int64) ENDP ; clock_read
>
>
> Here's what I think was happening, the MSVC compiler produced assembly
> for "hand & NBuffersPow2Mask" that uses "rdx QWORD" while GCC uses "edi
> DWORD".  The 32-bit AND operation (edi) automatically zeros the upper 32
> bits of rdi after performing the and with the uint64 value of hand while
> "rdx QWORD" does not potentially leaving some of the upper 32 bits set. 
> My guess is that on Windows when the value of the clock hand exceeded
> UINT32_MAX (as can happen in as little as 3 seconds in a tight loop but
> likely took longer in the test run) the bits not masked out would
> inflate the resulting value which would be > NBufers and also differ
> from the simple modulo calculation causing the failed assertion.
>
> Changing the value of NBuffersPow2Mask to uint64 and using similarly
> sized types in these functions more closely aligns the assembly code and
> should fix this.

And it did make the code more correct, it just didn't fix that
particular bug.  The bug was in the logic for my optimized modulo.  Big
thank you to my PostgreSQL committer mentor Noah Misch who asked a
simple question yesterday in our monthly call, "if this algorithm is
correct, why isn't it how CPUs and/or compilers implement modulo?"  Then
went on to suggest that I write an exhaustive test for it (maybe even
coerce an LLM to do that for me) and test it.  So I did, and it was wrong.

I felt I'd give it one more try.  It turns out there is an algorithm [1]
based on some work that went into GMP a while ago [2], so I implemented
it and gave a try (attached test.c).  Without optimization it's 10-42%
faster, but when compiled with GCC -O3 or -Ofast that advantage goes
away and it is slightly slower.  So, simplicity for the win and let the
compiler do the hard parts I guess.

>>>>> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>>>>>
>>>>> /*
>>>>> * Compute strategy_delta = how many buffers have been scanned by the
>>>>> - * clock sweep since last time. If first time through, assume none. Then
>>>>> - * see if we are still ahead of the clock sweep, and if so, how many
>>>>> + * clock-sweep since last time. If first time through, assume none. Then
>>>>> + * see if we are still ahead of the clock-sweep, and if so, how many
>>>>> * buffers we could scan before we'd catch up with it and "lap" it. Note:
>>>>> * weird-looking coding of xxx_passes comparisons are to avoid bogus
>>>>> * behavior when the passes counts wrap around.
>>>>> */
>>>>> if (saved_info_valid)
>>>>> {
>>>>> - int32 passes_delta = strategy_passes - prev_strategy_passes;
>>>>> + int32 passes_delta;
>>>>> +
>>>>> + if (unlikely(prev_strategy_passes > strategy_passes))
>>>>> + {
>>>>> + /* wrap-around case */
>>>>> + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
>>>>> + }
>>>>> + else
>>>>> + {
>>>>> + passes_delta = (int32) (strategy_passes - prev_strategy_passes);
>>>>> + }
>>>>>
>>>>> strategy_delta = strategy_buf_id - prev_strategy_buf_id;
>>>>> strategy_delta += (long) passes_delta * NBuffers;
>>>> That seems somewhat independent of the rest of the change, or am I missing something?
>>> That change is there to cover the possibility of someone managing to
>>> overflow and wrap a uint64 which is *highly* unlikely.
>> That risk existed previously too - I'm not against shoring things up, I'd just
>> do it in a precursor commit, to make this easier to review.
>>
>>> If this degree of paranoia isn't required I'm happy to remove it.
>> That does indeed seem really unlikely. Assuming that postgres stays up for 10
>> years without a single restart, it'd be ~59 billion ticks a second.
> Agreed, it's overkill.
>> I don't mind a defense, but I think we'd be better off putting it into
>> ClockSweepTick() or such, simply erroring out if we ever hit this. It's
>> unlikely that we'd get (and keep) all the relevant untested code correct ime.
>> Then we also can assert that prev_strategy_passes <= strategy_passes.
> Added assertions and comments to explain decision.
>> Greetings,
>>
>> Andres Freund

Patch set is now:

1) remove freelist

2) remove buffer_strategy_lock

3) abstract clock-sweep to type and API

Rebased v10 onto 5457ea46d18 at GitHub[3] and CommitFest[4],

-greg

[1] https://stackoverflow.com/a/26047426/366692

[2] https://gmplib.org/~tege/divcnst-pldi94.pdf

[3] https://github.com/gburd/postgres/pull/10

[4] https://commitfest.postgresql.org/patch/5928/

Attachment Content-Type Size
test.c text/x-csrc 10.8 KB
v10-0001-Eliminate-the-freelist-from-the-buffer-manager-a.patch text/x-patch 18.7 KB
v10-0002-Remove-the-buffer_strategy_lock-and-make-the-clo.patch text/x-patch 18.6 KB
v10-0003-Abstract-clock-sweep-buffer-replacement-algorith.patch text/x-patch 7.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2025-07-25 19:31:47 Re: Non-text mode for pg_dumpall
Previous Message Andrey Borodin 2025-07-25 18:33:39 Re: IPC/MultixactCreation on the Standby server