Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

From: Greg Burd <greg(at)burd(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Date: 2025-07-22 18:43:26
Message-ID: 2c98efe2-412e-4892-92ce-e590be414247@burd.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/21/25 14:35, Andres Freund wrote:
> Hi,
>
> On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
>> On 7/18/25 13:03, Andres Freund wrote:
>> Hello.  Thanks again for taking the time to review the email and patch,
>> I think we're onto something good here.
>>
>>> I'd be curious if anybody wants to argue for keeping the clock sweep. Except
>>> for the have_free_buffer() use in autoprewarm, it's a rather trivial
>>> patch. And I really couldn't measure regressions above the noise level, even
>>> if absurdly extreme use cases.
>> Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
>> for keeping the freelist"?
> Err, yes :(

Phew.  :)  No worries.

>>> On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
>>>> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
>>>>> I think we'll likely need something to replace it.
>>>> Fair, this (v5) patch doesn't yet try to address this.
>>>>
>>>>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
>>>>> right. The goal of the use have_free_buffer() is obviously to stop
>>>>> prewarming
>>>>> shared buffers if doing so would just evict buffers.  But it's not clear
>>>>> to me
>>>>> that we should just stop when there aren't any free buffers - what if the
>>>>> previous buffer contents aren't the right ones?  It'd make more sense to
>>>>> me to
>>>>> stop autoprewarm once NBuffers have been prewarmed...
>>>> I had the same high level reaction, that autoprewarm was leveraging
>>>> something
>>>> convenient but not necessarily required or even correct.  I'd considered
>>>> using
>>>> NBuffers as you describe due to similar intuitions, I'll dig into that idea
>>>> for
>>>> the next revision after I get to know autoprewarm a bit better.
>>> Cool. I do think that'll be good enough.
>> I re-added the have_free_buffer() function only now it returns false
>> once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
>> has made its first complete pass.  With that I reverted my changes in
>> the autoprewarm module.  The net should be the same behavior as before
>> at startup when using that module.
> I don't think we should have a have_free_buffer() that doesn't actually test
> whether we have a free buffer, that seems too likely to cause
> misunderstandings down the line. What if we instead just limit the amount of
> buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
> the number of to-be-loaded buffers, so that shouldn't be hard.

I'm glad you said that, I wasn't thrilled with that either and I'm not
sure why I didn't just correct for that in the last patch set.  I'm now
capping num_elements to NBuffers at most.

>>>> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
>>>> I'll dig into the Windows issues next week as well.
>>> FWIW, there are backtraces generated on windows. E.g.
>>>
>>> https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
>>>
>>> 000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
>>> 000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
>>> char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
>>> char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
>>> int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
>>> 000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
>>> unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
>>> 000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
>>> unsigned int * complete_passes = 0x000000cd`827fdfc0,
>>> unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
>>> 000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
>>> struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
>>> 000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
>>> void * startup_data = 0x00000000`00000000,
>>> unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
>>> 000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
>>> int argc = 0n3,
>>> char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
>>> 000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
>>> int argc = 0n3,
>>> char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
>>>
>>> I.e. your new assertion failed for some reason that i can't *immediately* see.
>> I put that in as a precaution and as a way to communicate the intention
>> of the other code above it.  I never imagined it would assert.  I've
>> changed clock_read() to only assert when the modulo differs and left
>> that assert in the calling ClockSweepTick() function because it was
>> redundant and I'm curious to see if we see a similar assert when testing
>> the modulo.
> Do you understand why it triggered? Because I don't immediately. The fact that
> it triggered only on windows, where the compiler is rather different, makes it
> worth understanding imo.

I dug into the ASM for both GCC 15.1 and MSVC 19.latest (thanks
godbolt.org!) for x86_64 and there was a critical difference.  It starts
with the fact that I'd used uint32 for my NBuffersPow2Mask rather than
uint64.  That then translates to two different compiled outputs for
clock_read() (was: clock_modulo()).

gcc-15.1 -O2

clock_read(unsigned long long): and edi, DWORD PTR NBuffersPow2Mask[rip]
mov edx, DWORD PTR NBuffers[rip] mov rax, rdi sub rax, rdx cmp rdi, rdx
cmovb rax, rdi ret

msvc-19.latest /O2

hand$ = 8 unsigned int clock_read(unsigned int64) PROC ; clock_read,
COMDAT mov edx, ecx and rdx, QWORD PTR unsigned int64 NBuffersPow2Mask ;
NBuffersPow2Mask mov ecx, DWORD PTR unsigned int NBuffers ; NBuffers mov
eax, edx sub eax, ecx cmp rdx, rcx cmovb eax, edx ret 0 unsigned int
clock_read(unsigned __int64) ENDP ; clock_read

Here's what I think was happening, the MSVC compiler produced assembly
for "hand & NBuffersPow2Mask" that uses "rdx QWORD" while GCC uses "edi
DWORD".  The 32-bit AND operation (edi) automatically zeros the upper 32
bits of rdi after performing the and with the uint64 value of hand while
"rdx QWORD" does not potentially leaving some of the upper 32 bits set. 
My guess is that on Windows when the value of the clock hand exceeded
UINT32_MAX (as can happen in as little as 3 seconds in a tight loop but
likely took longer in the test run) the bits not masked out would
inflate the resulting value which would be > NBufers and also differ
from the simple modulo calculation causing the failed assertion.

Changing the value of NBuffersPow2Mask to uint64 and using similarly
sized types in these functions more closely aligns the assembly code and
should fix this.

>>>> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>>>>
>>>> /*
>>>> * Compute strategy_delta = how many buffers have been scanned by the
>>>> - * clock sweep since last time. If first time through, assume none. Then
>>>> - * see if we are still ahead of the clock sweep, and if so, how many
>>>> + * clock-sweep since last time. If first time through, assume none. Then
>>>> + * see if we are still ahead of the clock-sweep, and if so, how many
>>>> * buffers we could scan before we'd catch up with it and "lap" it. Note:
>>>> * weird-looking coding of xxx_passes comparisons are to avoid bogus
>>>> * behavior when the passes counts wrap around.
>>>> */
>>>> if (saved_info_valid)
>>>> {
>>>> - int32 passes_delta = strategy_passes - prev_strategy_passes;
>>>> + int32 passes_delta;
>>>> +
>>>> + if (unlikely(prev_strategy_passes > strategy_passes))
>>>> + {
>>>> + /* wrap-around case */
>>>> + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
>>>> + }
>>>> + else
>>>> + {
>>>> + passes_delta = (int32) (strategy_passes - prev_strategy_passes);
>>>> + }
>>>>
>>>> strategy_delta = strategy_buf_id - prev_strategy_buf_id;
>>>> strategy_delta += (long) passes_delta * NBuffers;
>>> That seems somewhat independent of the rest of the change, or am I missing something?
>> That change is there to cover the possibility of someone managing to
>> overflow and wrap a uint64 which is *highly* unlikely.
> That risk existed previously too - I'm not against shoring things up, I'd just
> do it in a precursor commit, to make this easier to review.
>
>> If this degree of paranoia isn't required I'm happy to remove it.
> That does indeed seem really unlikely. Assuming that postgres stays up for 10
> years without a single restart, it'd be ~59 billion ticks a second.
Agreed, it's overkill.
> I don't mind a defense, but I think we'd be better off putting it into
> ClockSweepTick() or such, simply erroring out if we ever hit this. It's
> unlikely that we'd get (and keep) all the relevant untested code correct ime.
> Then we also can assert that prev_strategy_passes <= strategy_passes.
Added assertions and comments to explain decision.
> Greetings,
>
> Andres Freund

rebased onto ce6513e96a1 patch set v7 attached and available on GitHub
[1][2],

-greg

[1] https://github.com/gburd/postgres/pull/9

[2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v7

Attachment Content-Type Size
v7-0001-Eliminate-the-freelist-from-the-buffer-manager-an.patch text/x-patch 18.7 KB
v7-0002-Remove-the-buffer_strategy_lock-and-make-the-cloc.patch text/x-patch 20.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-07-22 18:44:29 Re: Parallel heap vacuum
Previous Message Masahiko Sawada 2025-07-22 18:32:06 Re: Support getrandom() for pg_strong_random() source