Re: Changing shared_buffers without restart

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Jack Ng <Jack(dot)Ng(at)huawei(dot)com>, Ni Ku <jakkuniku(at)gmail(dot)com>
Subject: Re: Changing shared_buffers without restart
Date: 2025-09-18 04:47:15
Message-ID: CAExHW5s=0Yz7ZhN6J+ORUBEpme-ZkfOC=KaDuD6LHtcwC8Abvg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,
Thanks for your detailed feedback. Sorry for replying late.

On Fri, Jul 4, 2025 at 5:36 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> v5-0008-Support-shrinking-shared-buffers.patch
>
> 1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't
> there a place starting it and waiting for it to complete? Why
> shouldn't it do EvictExtraBuffers itself?
>
> 3) Seems a bit strange to do it from a random backend. Shouldn't it
> be the responsibility of a process like checkpointer/bgwriter, or
> maybe a dedicated dynamic bgworker? Can we even rely on a backend
> to be available?

I will answer these two together. I don't think we should rely on a
random backend. But that's what the rest of the patches did and
patches to support shrinking followed them. But AFAIK, Dmitry is
working on a set of changes which will make a non-postmaster backend
to be a coordinator for buffer pool resizing process. When that
happens the same backend which initializes the expanded memory when
expanding the buffer pool should also be responsible for evicting the
buffers when shrinking the buffer pool. Will wait for Dmitry's next
set of patches before making this change.

>
> 4) Unsolved issues with buffers pinned for a long time. Could be an
> issue if the buffer is pinned indefinitely (e.g. cursor in idle
> connection), and the resizing blocks some activity (new connections
> or stuff like that).

In such cases we should cancel the operation or kill that backend (per
user preference) after a timeout with (user specified) timeout >= 0.
We haven't yet figured out the details. I think the first version of
the feature would just cancel the operation, if it encounters a pinned
buffer.

> 2) Isn't the change to BufferManagerShmemInit wrong? How do we know the
> last buffer is still at the end of the freelist? Seems unlikely.
> 6) It's not clear to me in what situations this triggers (in the call
> to BufferManagerShmemInit)
>
> if (FirstBufferToInit < NBuffers) ...
>

Will answer these two together. As the comment says FirstBufferToInit
< NBuffers indicates two situations: When FirstBufferToInit = 0, it's
the first time the buffer pool is being initialized. Otherwise it
indicates expanding the buffer pool, in which case the last buffer
will be a newly initialized buffer. All newly initialized buffers are
linked into the freelist one after the other in the increasing order
of their buffer ids by code a few lines above. Now that the free
buffer list has been removed, we don't need to worry about it. In the
next set of patches, I have removed this code.

>
> v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
>
> 1) IMHO this should be included in the earlier resize/shrink patches,
> I don't see a reason to keep it separate (assuming this is the
> correct way, and the "init" is not).

These patches are separate just because me and Dmitry developed them
respectively. Once they are reviewed by Dmitry, we will squash them
into a single patch. I am expecting that Dmitry's next patchset which
will do significant changes to the synchronization will have a single
patch for all code related to and consequential to resizing.

>
> 5) Funny that "AI suggests" something, but doesn't the block fail to
> reset nextVictimBuffer of the clocksweep? It may point to a buffer
> we're removing, and it'll be invalid, no?
>

The TODO no more applies. There's code to reset the clocksweep in a
separate patch. Sorry for not removing it earlier. It will be removed
in the next set of patches.

>
> 2) Doesn't StrategyPurgeFreeList already do some of this for the case
> of shrinking memory?
>
> 3) Not great adding a bunch of static variables to bufmgr.c. Why do we
> need to make "everything" static global? Isn't it enough to make
> only the "valid" flag global? The rest can stay local, no?
>
> If everything needs to be global for some reason, could we at least
> make it a struct, to group the fields, not just separate random
> variables? And maybe at the top, not half-way throught the file?
>
> 4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting
> anything, it's just invalidating the info about past runs.

I think there's a bit of refactoring possible here. Setting up the
BgBufferSync state, resetting it when bgwriter_lru_maxpages <= 0 and
then re initializing it when bgwriter_lru_maxpages > 0, and actually
performing the buffer sync is all packed into the same function
BgBufferSync() right now. It makes this function harder to read. I
think these functionalities should be separated into their own
functions and use the appropriate one instead of BgBufferSyncAdjust(),
whose name is misleading. The static global variables should all be
packed into a structure which is passed as an argument to these
functions. I need more time to study the code and refactor it that
way. For now I have added a note to the commit message of this patch
so that I will revisit it. I have renamed BgBufferSyncAdjust() to
BgBufferSyncReset().

>
> 5) I don't quite understand why BufferSync needs to do the dance with
> delay_shmem_resize. I mean, we certainly should not run BufferSync
> from the code that resizes buffers, right? Certainly not after the
> eviction, from the part that actually rebuilds shmem structs etc.

Right. But let me answer all three questions together.

> So perhaps something could trigger resize while we're running the
> BufferSync()? Isn't that a bit strange? If this flag is needed, it
> seems more like a band-aid for some issue in the architecture.
>
> 6) Also, why should it be fine to get into situation that some of the
> buffers might not be valid, during shrinking? I mean, why should
> this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers).
> It seems better to ensure we never get into "sync" in a way that
> might lead some of the buffers invalid. Seems way too lowlevel to
> care about whether resize is happening.
>
> 7) I don't understand the new condition for "Execute the LRU scan".
> Won't this stop LRU scan even in cases when we want it to happen?
> Don't we want to scan the buffers in the remaining part (after
> shrinking), for example? Also, we already checked this shmem flag at
> the beginning of the function - sure, it could change (if some other
> process modifies it), but does that make sense? Wouldn't it cause
> problems if it can change at an arbitrary point while running the
> BufferSync? IMHO just another sign it may not make sense to allow
> this, i.e. buffer sync should not run during the "actual" resize.
>

ProcessBarrierShmemResize() which does the resizing is part of
ProcessProcSignalBarrier() which in turn gets called from
CHECK_FOR_INTERRUPTS(), which is called from multiple places, even
from elog(). I am not able to find a call stack linking BgBufferSync()
and ProcessProcSignalBarrier(). But I couldn't convince myself that it
is true and will remain true in the future. I mean, the function loops
through a large number of buffers and performs IO, both avenues to
call CHECK_FOR_INTERRUPTS(). Hence that flag. Do you know what (part
of code) guarantees that ProcessProcSignalBarrier() will never be
called from BgBufferSync()?

Note, resizing can not begin till delay_shmem_resize is cleared, so
while BgBufferSync is executing, no buffer can be invalidated or no
new buffers could be added. But at the cost of all other backends to
wait till BgBufferSync finishes. We want to avoid that. The idea here
is to make BgBufferSync stop as soon as it realises that the buffer
resizing is "about to begin". But I think the condition looks wrong. I
think the right condition would be NBufferPending != NBuffers or
NBuffersOld. AFAIK, Dmitry is working on consolidating NBuffers*
variables as you have requested elsewhere. Better even if we could
somehow set a flag in shared memory indicating that the buffer
resizing is "about to begin" and BgBufferSync() checks that flag. So I
will wait for him to make that change and then change this condition.

>
> v5-0010-Additional-validation-for-buffer-in-the-ring.patch
>
> 1) So the problem is we might create a ring before shrinking shared
> buffers, and then GetBufferFromRing will see bogus buffers? OK, but
> we should be more careful with these checks, otherwise we'll miss
> real issues when we incorrectly get an invalid buffer. Can't the
> backends do this only when they for sure know we did shrink the
> shared buffers? Or maybe even handle that during the barrier?
>
> 2) IMHO a sign there's the "transitions" between different NBuffers
> values may not be clear enough, and we're allowing stuff to happen
> in the "blurry" area. I think that's likely to cause bugs (it did
> cause issues for the online checksums patch, I think).
>

I think you are right, that this might hide some bugs. Just like we
remove buffers to be shrunk from freelist only once, I wanted each
backend to remove them buffer rings only once. But I couldn't find a
way to make all the buffer rings for a given backend available to the
barrier handling code. The rings are stored in Scan objects, which
seem local to the executor nodes. Is there a way to make them
available to barrier handling code (even if it has to walk an
execution tree, let's say)?

If there would have been only one scan, we could have set a flag after
shrinking, let GetBufferFromRing() purge all invalid buffers once when
flag is true and reset the flag. But there can be more than one scan
happening and we don't know how many there are and when all of them
have finished calling GetBufferFromRing() after shrinking. Suggestions
to do this only once are welcome.

I will send the next set of patches with my next email.

--
Best Wishes,
Ashutosh Bapat

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-09-18 04:52:35 Re: Reword messages using "as" instead of "because"
Previous Message Peter Smith 2025-09-18 04:46:17 Re: [WIP]Vertical Clustered Index (columnar store extension) - take2