From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Greg Burd <greg(at)burd(dot)me> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock |
Date: | 2025-07-18 17:03:22 |
Message-ID: | v5sxesrt3oq5eprz52327opm7vrvdbgjjw3pck25vl7kmrm74a@nxxxjphaij5g |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I'd be curious if anybody wants to argue for keeping the clock sweep. Except
for the have_free_buffer() use in autoprewarm, it's a rather trivial
patch. And I really couldn't measure regressions above the noise level, even
if absurdly extreme use cases.
On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
> > I think we'll likely need something to replace it.
>
> Fair, this (v5) patch doesn't yet try to address this.
>
> > TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
> > right. The goal of the use have_free_buffer() is obviously to stop
> > prewarming
> > shared buffers if doing so would just evict buffers. But it's not clear
> > to me
> > that we should just stop when there aren't any free buffers - what if the
> > previous buffer contents aren't the right ones? It'd make more sense to
> > me to
> > stop autoprewarm once NBuffers have been prewarmed...
>
> I had the same high level reaction, that autoprewarm was leveraging
> something
> convenient but not necessarily required or even correct. I'd considered
> using
> NBuffers as you describe due to similar intuitions, I'll dig into that idea
> for
> the next revision after I get to know autoprewarm a bit better.
Cool. I do think that'll be good enough.
> > The most obvious way around this would be to make the clock hand a 64bit
> > atomic, which would avoid the need to have a separate tracker for the
> > number
> > of passes. Unfortunately doing so would require doing a modulo
> > operation each
> > clock tick, which I think would likely be too expensive on common
> > platforms -
> > on small shared_buffers I actually see existing, relatively rarely
> > reached,
> > modulo in ClockSweepTick() show up on a Skylake-X system.
>
> So, this idea came back to me today as I tossed out the union branch and
> started
> over.
>
> a) can't require a power of 2 for NBuffers
> b) would like a power of 2 for NBuffers to make a few things more efficient
> c) a simple uint64 atomic counter would simplify things
>
> The attached (v5) patch takes this approach *and* avoids the modulo you were
> concerned with. My approach is to have nextVictimBuffer as a uint64 that
> only
> increments (and at some point 200 years or so might wrap around, but I
> digress).
> To get the actual "victim" you modulo that, but not with "%" you call
> clock_modulo(). In that function I use a "next power of 2" value rather
> than
> NBuffers to efficiently find the modulo and adjust for the actual value.
> Same
> for completePasses which is now a function clock_passes() that does similar
> trickery and returns the number of times the counter (nextVictimBuffer) has
> "wrapped" around modulo NBuffers.
Yea, that could work! It'd be interesting to see some performance numbers for
this...
> Now that both values exist in the same uint64 it can be the atomic vessel
> that coordinates them, no synchronization problems at all and no requirement
> for the buffer_strategy_lock.
Nice!
> > I think while at it, we should make ClockSweepTick() decrement
> > nextVictimBuffer by atomically subtracting NBuffers, rather than using
> > CAS. I
> > recently noticed that the CAS sometimes has to retry a fair number of
> > times,
> > which in turn makes the `victim % NBuffers` show up in profiles.
>
> In my (v5) patch there is one CAS that increments NBuffers. All other
> operations on NBuffers are atomic reads. The modulo you mention is gone
> entirely, unnecessary AFAICT.
There shouldn't be any CASes needed now, right? Just a fetch-add? The latter
often scales *way* better under contention.
[Looks at the patch ...]
Which I think is true in your patch, I don't see any CAS.
> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
> I'll dig into the Windows issues next week as well.
FWIW, there are backtraces generated on windows. E.g.
000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo(
unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart(
unsigned int * complete_passes = 0x000000cd`827fdfc0,
unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 300]
000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync(
struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ 3649]
000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain(
void * startup_data = 0x00000000`00000000,
unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
000000cd`827ff620 00007ff7`6af0f5a9 postgres!main(
int argc = 0n3,
char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
I.e. your new assertion failed for some reason that i can't *immediately* see.
> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>
> /*
> * Compute strategy_delta = how many buffers have been scanned by the
> - * clock sweep since last time. If first time through, assume none. Then
> - * see if we are still ahead of the clock sweep, and if so, how many
> + * clock-sweep since last time. If first time through, assume none. Then
> + * see if we are still ahead of the clock-sweep, and if so, how many
> * buffers we could scan before we'd catch up with it and "lap" it. Note:
> * weird-looking coding of xxx_passes comparisons are to avoid bogus
> * behavior when the passes counts wrap around.
> */
> if (saved_info_valid)
> {
> - int32 passes_delta = strategy_passes - prev_strategy_passes;
> + int32 passes_delta;
> +
> + if (unlikely(prev_strategy_passes > strategy_passes))
> + {
> + /* wrap-around case */
> + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
> + }
> + else
> + {
> + passes_delta = (int32) (strategy_passes - prev_strategy_passes);
> + }
>
> strategy_delta = strategy_buf_id - prev_strategy_buf_id;
> strategy_delta += (long) passes_delta * NBuffers;
That seems somewhat independent of the rest of the change, or am I missing something?
> +static uint32 NBuffersPow2; /* NBuffers rounded up to the next power of 2 */
> +static uint32 NBuffersPow2Shift; /* Amount to bitshift NBuffers for
> + * division */
For performance in ClockSweepTick() it might more sense to store the mask
(i.e. NBuffersPow2 - 1), rather than the actual power of two.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2025-07-18 17:06:15 | Re: Log prefix missing for subscriber log messages received from publisher |
Previous Message | Andres Freund | 2025-07-18 16:46:44 | Re: Adding basic NUMA awareness |