| From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me>, Peter Eisentraut <peter(at)eisentraut(dot)org> |
| Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, chaturvedipalak1911(at)gmail(dot)com, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: Changing shared_buffers without restart |
| Date: | 2026-01-28 13:19:08 |
| Message-ID: | CAExHW5s8s=UhjqNa_Tz1PFCRLzt3=5nvd5vD1wFdKWMQCmFySQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Nov 14, 2025 at 5:23 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> Hi,
> PFA new patchset with some TODOs from previous email addressed:
>
> On Mon, Oct 13, 2025 at 9:28 PM Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > 1. New backends join while the synchronization is going on.
>
> Done. Explained the solution below in detail.
>
> > An existing backend exiting.
>
> Not tested specifically, but should work.
>
> > 2. Failure or crash in the backend which is executing pg_resize_buffer_pool()
>
> still a TODO
>
> > 3. Fix crashes in the tests.
>
> core regression passes, pg_buffercache regression tests pass and the
> tests for buffer resizing pass most of the time. So far I have seen
> two issues
> 1. An assertion from AIO worker - which happened only once and I
> couldn't reproduce again. Need to study interaction of AIO worker with
> buffer resizing.
> 2. checkpointer crashes - which is one of the TODOs listed below.
> 3. Also there's an shared memory id related failure, which I don't
> understand but happen more frequently than the first one. Need to look
> into that.
>
> > go through Tomas's detailed comments and address those
> > which still apply.
>
> Still a TODO. But since many of those patches are revised heavily, I
> think many of the comments may have been addressed, some may not apply
> anymore.
Thanks a lot Tomas for these review comments and summary of the
discussion as of your response. Sorry, it took so long to fully
respond to this email, but I wanted the patches to be in a reasonable
shape before responding. It's been a while since I have posted the
last patchset. Hence attaching patchset along with responses to your
comments as well as the earlier comments that you mentioned in your
response. Dmitry has responded to some parts of these emails already
but I am responding to all the comments in the context of the latest
implementation which has been revised heavily since his responses. The
latest design and UI is described in details in [1]. Please note that
that email also responds to some earlier comments from Andres and
others. I have not repeated those responses here.
The earlier patches did not build in EXEC_BACKEND mode. The attached
patches build in EXEC_BACKEND mode and also the regression tests pass
in that build. I haven't tried running whole test suite though.
>
> I agree it'd be useful to be able to resize shared buffers, without
> having to restart the instance (which is obviously very disruptive). So
> if we can make this work reliably, with reasonable trade offs (both on
> the backends, and also the risks/complexity introduced by the feature).
Agreed. That's the goal - to make it work reliably with reasonable
trade-offs. Adding some details to this goal as follows:
1. Performance
--------------
a. The impact of resizing on the performance of the concurrent
database operations should be reasonable and acceptable.
b. The code changes should not cause significant performance
degradation during normal operations
Palak Chaturvedi has already done some benchmarking. I am listing her
findings here
1. The code changes do not have any noticeable performance degradation
during normal operations.
2. Resizing itself is also reasonably fast and with very minimal,
almost unnoticeable, performance impact on the concurrent
transactions.
We will share the benchmarks once the code is cleaner and more complete stable.
2. Reliability
--------------
We are adding tests to make sure that resizing does not introduce any
crashes, hazards or data corruption. Some tests are already part of
the patch set.
>
> I'm far from an expert on mmap() and similar low-level stuff, but the
> current appproach (reserving a big chunk of shared memory and slicing
> it by mmap() into smaller segments) seems reasonable.
The latest patches, we reserve address space by mmap and manage memory
within that address space using file backed memory.
>
> But I'm getting a bit lost in how exactly this interacts with things
> like overcommit, system memory accounting / OOM killer and this sort of
> stuff. I went through the thread and it seems to me the reserve+map
> approach works OK in this regard (and the messages on linux-mm seem to
> confirm this). But this information is scattered over many messages and
> it's hard to say for sure, because some of this might be relevant for
> an earlier approach, or a subtly different variant of it.
Looking at the documentation and the messages on linux-mm, it seems
that the mmap with MAP_NORESERVE approach works well with overcommit,
system memory accounting and OOM killer. I plan to add tests verify
these aspects, so that any changes in the future do not cause any
regressions. I am looking for ways to write a small test program for
the same, which we can add to our test battery. But no success so far.
Any idea?
>
> A similar question is portability. The comments and commit messages
> seem to suggest most of this is linux-specific, and other platforms just
> don't have these capabilities. But there's a bunch of messages (mostly
> by Thomas Munro) that hint FreeBSD might be capable of this too, even if
> to some limited extent. And possibly even Windows/EXEC_BACKEND, although
> that seems much trickier.
>
> FWIW I think it's perfectly fine to only support resizing on selected
> platforms, especially considering Linux is the most widely used system
> for running Postgres. We still need to be able to build/run on other
> systems, of course. And maybe it'd be good to be able to disable this
> even on Linux, if that eliminates some overhead and/or risks for people
> who don't need the feature. Just a thought.
+1. Plan is to make it work on Linux first and disable it on the other
platforms. I think it's a good idea to be able to disable it at initdb
time say. But I am not yet sure if that's going to avoid any overhead
or risks. I haven't yet wrapped my head around the platform dependency
completely.
>
> Anyway, my main point is that this information is important, but very
> scattered over the thread. It's a bit foolish to expect everyone who
> wants to do a review to read the whole thread (which will inevitably
> grow longer over time), and assemble all these pieces again an again,
> following all the changes in the design etc. Few people will get over
> that hurdle, IMHO.
>
> So I think it'd be very helpful to write a README, explaining the
> currnent design/approach, and summarizing all these aspects in a single
> place. Including things like portability, interaction with the OS
> accounting, OOM killer, this kind of stuff. Some of this stuff may be
> already mentioned in code comments, but you it's hard to find those.
>
> Especially worth documenting are the states the processes need to go
> through (using the barriers), and the transacitons between them (i.e.
> what is allowed in each phase, what blocks can be visible, etc.).
>
That's a great idea. There is already a README in
src/backend/storage/buffer/, I have added a section of buffer pool
resizing. The section has pointers to relevant code comments. I will
revise this to be as self sufficient as possible as the patches
mature. The code managing the shared memory is scattered across
backend/portability, storage/ipc etc. It's slightly troublesome to
find one good place to add a README which explains everything.
>
> I'll go over some higher-level items first, and then over some comments
> for individual patches.
>
>
> 1) no user docs
>
> There are no user .sgml docs, and maybe it's time to write some,
> explaining how to use this thing - how to configure it, how to trigger
> the resizing, etc. It took me a while to realize I need to do ALTER
> SYSTEM + pg_reload_conf() to kick this off.
Latest patches have updated config.sgml and func-admin.sgml to
document the functionality. Please review them.
>
> It should also document the user-visible limitations, e.g. what activity
> is blocked during the resizing, etc.
>
We are still figuring this out. But we will document the limitations
in func-admin.sgml. Right now there are no limitations and my
intention is to keep the limitations as limited as possible.
>
> 2) pending GUC changes
>
> I'm somewhat skeptical about the GUC approach. I don't think it was
> designed with this kind of use case in mind, and so I think it's quite
> likely it won't be able to handle it well.
>
> For example, there's almost no validation of the values, so how do you
> ensure the new value makes sense? Because if it doesn't, it can easily
> crash the system (I've seen such crashes repeatedly, I'll get to that).
> Sure, you may do ALTER SYSTEM to set shared_buffers to nonsense and it
> won't start after restart/reboot, but crashing an instance is maybe a
> little bit more annoying.
>
> Let's say we did the ALTER SYSTEM + pg_reload_conf(), and it gets stuck
> waiting on something (can't evict a buffer or something). How do you
> cancel it, when the change is already written to the .auto.conf file?
> Can you simply do ALTER SYSTEM + pg_reload_conf() again?
>
> It also seems a bit strange that the "switch" gets to be be driven by a
> randomly selected backend (unless I'm misunderstanding this bit). It
> seems to be true for the buffer eviction during shrinking, at least.
>
> Perhaps this should be a separate utility command, or maybe even just
> a new ALTER SYSTEM variant? Or even just a function, similar to what
> the "online checksums" patch did, possibly combined with a bgworder
> (but probably not needed, there are no db-specific tasks to do).
>
As documented in func-admin.sgml, and config.sgml, resizing the buffer
pool requires ALTER SYSTEM + pg_reload_conf() followed by
pg_resize_shared_buffers(). There is no pending flag, no arbitrary
backend driving the process. The validation, failures are handled by
pg_resize_shared_buffers(). The backend where
pg_resize_shared_buffers() is called is responsible for driving the
resizing process. I think, the new implementation takes care of all
the concerns mentioned above.
>
> 3) max_available_memory
>
> Speaking of GUCs, I dislike how max_available_memory works. It seems a
> bit backwards to me. I mean, we're specifying shared_buffers (and some
> other parameters), and the system calculates the amount of shared memory
> needed. But the limit determines the total limit?
>
> I think the GUC should specify the maximum shared_buffers we want to
> allow, and then we'd work out the total to pre-allocate? Considering
> we're only allowing to resize shared_buffers, that should be pretty
> trivial. Yes, it might happen that the "total limit" happens to exceed
> the available memory or something, but we already have the problem
> with shared_buffers. Seems fine if we explain this in the docs, and
> perhaps print the calculated memory limit on start.
>
> In any case, we should not allow setting a value that ends up
> overflowing the internal reserved space. It's true we don't have a good
> way to do checks for GUcs, but it's a bit silly to crash because of
> hitting some non-obvious internal limit that we necessarily know about.
>
> Maybe this is a reason why GUC hooks are not a good way to set this.
Latest patches introduce max_shared_buffers which specifies the
maximum shared buffers allowed. max_available_memory, that some
previous patchsets had, has been removed. This GUC is mentioned in the
documentation changes.
One relatively small thing to think about is what do we call
shared_buffers GUC - it's not SIGHUP per say since it requires a
function to be called after the reload but it's not POSTMASTER either
since restart is not required. I think we need a new PGC_ for it but
haven't yet thought of a good name. Do you have any suggestions?
>
>
> 4) SHMEM_RESIZE_RATIO
>
> The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
> these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
> descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
> 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
> the rest is only "reserved".
>
> However, that just makes the max_available_memory a bit misleading,
> because you can't ever use it. You can use the 60% for shared buffers
> (which is not mentioned anywhere, and good luck not overflowing that,
> as it's never checked), but those smaller regions are guaranteed to be
> mostly unused. Unfortunate.
>
> And it's not just a matter of fixing those ratios, because then someone
> rebuilds with 32kB blocks and you're in the same situation.
>
> Moreover, all of the above is for mappings sized based on NBuffers. But
> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
> moment someone increases of max_connection, max_locks_per_transaction
> and possibly some other stuff?
>
+1. Fixed. max_shared_buffers only deals with the shared buffers.
>
> 5) no tests
>
> I mentioned no "user docs", but the patch has 0 tests too. Which seems
> a bit strange for a patch of this age.
>
> A really serious part of the patch series seems to be the coordination
> of processes when going through the phases, enforced by the barriers.
> This seems like a perfect match for testing using injection points, and
> I know we did something like this in the online checksums patch, which
> needs to coordinate processes in a similar way.
>
> But even just a simple TAP test that does a bunch of (random?) resizes
> while running a pgbench seem better than no tests. (That's what I did
> manually, and it crashed right away.)
>
> There's a lot more stuff to test here, I think. Idle sessions with
> buffers pinned by open cursors, multiple backends doing ALTER SYSTEM
> + pg_reload_conf concurrently, other kinds of failures.
>
+1. The latest patches already have a few tests but we are adding more
tests to cover different scenarios.
>
> 6) SIGBUS failures
>
> As mentioned, I did some simple tests with shrink/resize with a pgbench
> in the background, and it almost immediately crashed for me :-( With a
> SIGBUS, which I think is fairly rare on x86 (definitely much less common
> than e.g. SIGSEGV).
>
> An example backtrace attached.
>
There is a test which runs pgbench concurrently with resizing and it
fails much less frequently (1 in 20 or even lower). Quite likely the
crash you have seen is also fixed. Please let me know if the test
fails for you and if there is something that is in your test but not
that test.
>
> 7) EXEC_BACKEND, FreeBSD
>
> We clearly need to keep this working on systems without the necessary
> bits (so likely EXEC_BACKEND, FreeBSD etc.). But the builds currently
> fail in both cases, it seems.
>
> I think it's fine to not support resizing on every platform, then we'd
> never get it, but it still needs to build. It would be good to not have
> two very different code versions, one for resizing and one without it,
> though. I wonder if we can just have the "no-resize" use the same struct
> (with the segments/mapping, ...) and all that, but skipping the space
> reservation.
>
I have not thought fully through support on platforms other than
Linux. However, my current plan is to not support GUC
max_shared_buffers and the resizing function
pg_resize_shared_buffers() on platforms which do not support the
necessary bits. The code will still build and run on those platforms,
but resizing shared buffers without restart won't be possible.
>
> 8) monitoring
>
> So, let's say I start a resize of shared buffers. How will I know what
> it's currently doing, how much longer it might take, what it's waiting
> for, etc.? I think it'd be good to have progress monitoring, through
> the regular system view (e.g. pg_stat_shmem_resize_progress?).
>
When pg_resize_shared_buffers() finishes, user knows that the resizing
is finished. It usually takes a few seconds for the resizing to
finish. I am not sure whether we will need progress reporting. But
there might be cases where the resizing has to wait for something or
it may take longer for a given phase e.g. eviction. So we may require
progress reporting. Have added TODO in the code for the same.
>
> 10) what to do about stuck resize?
>
> AFAICS the resize can get stuck for various reasons, e.g. because it
> can't evict pinned buffers, possibly indefinitely. Not great, it's not
> clear to me if there's a way out (canceling the resize) after a timeout,
> or something like that? Not great to start an "online resize" only to
> get stuck with all activity blocked for indefinite amount of time, and
> get to restart anyway.
>
> Seems related to Thomas' message [2], but AFAICS the patch does not do
> anything about this yet, right? What's the plan here?
>
pg_resize_shared_buffers() is affected by timeouts like
statement_timeouts as well as query cancellation. However, current
implementation lacks graceful handling of these events. Another TODO.
>
> 11) preparatory actions?
>
> Even if it doesn't get stuck, some of the actions can take a while, like
> evicting dirty buffers before shrinking, etc. This is similar to what
> happens on restart, when the shutdown checkpoint can take a while, while
> the system is (partly) unavailable.
>
> The common mitigation is to do an explicit checkpoint right before the
> restart, to make the shutdown checkpoint cheap. Could we do something
> similar for the shrinking, e.g. flush buffers from the part to be
> removed before actually starting the resize?
>
Eviction is carried out before changes to the memory or shared buffer
metadata. If eviction fails, the resizing is rolled back. The system
continues to work with old buffer pool size. A test particularly
testing this aspect remains to be added.
>
> 12) does this affect e.g. fork() costs?
>
> I wonder if this affects the cost of fork() in some undesirable way?
> Could it make fork() more measurably more expensive?
>
A new backend needs to attach to extra shared memory segments, AFAIK
and detach those when exiting. But I don't think there's any other
extra work that a backend needs to do when starting or exiting.
Windows might be different. I haven't seen any noticeable difference
in pgbench performance which uses new connection for every
transaction. However, we haven't tried a benchmark with empty
transactions so that fork() and exit() are exercised at a higher
frequency. Another TODO.
>
> 13) resize "state" is all over the place
>
> For me, a big hurdle when reasoning about the resizing correctness is
> that there's quite a lot of distinct pieces tracking what the current
> "state" is. I mean, there's:
>
> - ShmemCtrl->NSharedBuffers
> - NBuffers
> - NBuffersOld
> - NBuffersPending
> - ... (I'm sure I missed something)
>
> There's no cohesive description how this fits together, it seems a bit
> "ad hoc". Could be correct, but I find it hard to reason about.
>
Next set of patches will consolidate the state only in two places
NBuffersPending and StrategyControl, which needs a new name. But even
in the attached patches it's consolidated in NBuffersPending,
ShmemCtrl and StrategyControl. There are instances of NBuffers which
will be replaced by variables from ShmemCtrl or StrategyControl.
>
> 14) interesting messages from the thread
>
> While reading through the thread, I noticed a couple messages that I
> think are still relevant:
>
> - I see Peter E posted some review in 2024/11 [3], but it seems his
> comments were mostly ignored. I agree with most of them.
Find detailed reply to that email at the end.
>
> - Robert mentioned a couple interesting failure scenarios in [4], not
> sure if all of this was handled. He howerver assumes pointers would
> not be stable (and that's something we should not allow, and the
> current approach works OK in this regard, I think). He also outlines
> how it'd happen in phases - this would be useful for the design README
> I think. It also reminds me the "phases" in the checksums patch.
>
stable pointers in the latest patch. The phases are explained in the
prologue of pg_resize_shared_buffers().
> - Robert asked [5] if Linux might abruptly break this, but I find that
> unlikely. We'd point out we rely on this, and they'd likely rethink.
> This would be made safer if this was specified by POSIX - taking that
> away once implemented seems way harder than for custom extensions.
> It's likely they'd not take away the feature without an alternative
> way to achieve the same effect, I think (yes, harder to maintain).
> Tom suggests [7] this is not in POSIX.
>
> - Matthias mentioned [6] similar flags on other operating systems. Could
> some of those be used to implement the same resizing?
Part of portability TODO.
>
> - Andres had an interesting comment about how overcommit interacts with
> MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
> accounting. There's also some comments about from linux-mm people [9].
>
New implementation uses MAP_NORESERVE. See my earlier response about overcommit.
> - There seem to be some issues with releasing memory backing a mapping
> with hugetlb [10]. With the fd (and truncating the file), this seems
> to release the memory, but it's linux-specific? But most of this stuff
> is specific to linux, it seems. So is this a problem? With this it
> should be working even for hugetlb ...
>
Right. mmap() + ftruncate(), instead of mmap() + mremap() allows us to
avoid going through postmaster, which makes the implementation much
simpler. And also support hugetlb. But it will be linux only.
> - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
> make the hugetlb stuff work just like on Linux? Unclear. Also, I
> thought the mfd stuff is linux-specific ... or am I confused?
>
Portability TODO.
> - Andres objected to any approach without pointer stability, and I agree
> with that. If we can figure out such solution, of course.
Since we call mmap only once, the address of the mappping does not
change; it is always stable. So pointer stability is guaranteed.
>
> - Thomas asked [13] why we need to stop all the backends, instead of
> just waiting for them to acknowledge the new (smaller) NBuffers value
> and then let them continue. I also don't quite see why this should
> not work, and it'd limit the disruption when we have to wait for
> eviction of buffers pinned by paused cursors, etc.
>
Approach in the latest patches does not stop all backends. Backends
continue to work while resizing is in progress. They are synchronized
at certain points using barriers. The details of synchronization for
each subsystem and worker backend need to be worked out. We are
working on that.
>
>
> Now, some comments about the individual patches (some of this may be a
> bit redundant with the earlier points):
Since these patches have been heavily rewritten because we have
redesigned and reimplemented the resizing process, some of the
comments may not be relevant any more. I will try to answer the
underlying concerns wherever applicable.
>
>
> v5-0001-Process-config-reload-in-AIO-workers.patch
>
> 1) Hmmm, so which other workers may need such explicit handling? Do all
> other processes participate in procsignal stuff, or does anything
> need an explicit handling?
>
As Andres mentioned in [2], this is not needed. It's not part of the
latest patches.
>
> v5-0003-Introduce-pss_barrierReceivedGeneration.patch
>
> 1) Do we actually need this? Isn't it enough to just have two barriers?
> Or a barrier + condition variable, or something like that.
>
> 2) The comment talks about "coordinated way" when processing messages,
> but it's not very clear to me. It should explain what is needed and
> not possible with the current barrier code.
>
> 3) This very much reminds me what the online checksums patch needed to
> do, and we managed to do it using plain barriers. So why does this
> need this new thing? (No opinion on whether it's correct.)
>
This isn't required in the latest implementation. Removed from the
latest patchset.
>
> v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch
>
> 1) "int shmem_segment" - wouldn't it be better to have a separate enum
> for this? I mean, we'll have a predefined list of segments, right?
>
+1. Andres suggested [2] to keep only two segments. So enum may be
superfluous, but may be good if we need more segments in future. TODO
for now.
> 2) typedef struct AnonymousMapping would deserve some comment
This structure no more exists. Instead we use MemoryMappingSizes to
hold the required sizes of shared memory segments.
>
> 3) ANON_MAPPINGS - Probably should be MAX_ANON_MAPPINGS? But we'll know
> how many we have, so why not to allocate exactly the right number?
> Or even just an array of structs, like in similar cases?
>
+1. Renamed as NUM_MEMORY_MAPPINGS and is used to declare and traverse
corresponding arrays.
> 4) static int next_free_segment = 0;
>
> We exactly know what segments we'll create and in which order, no? So
> why do we even bother with this next_free_segment thing? Can't we
> simply declare an array of AnonymousMapping elements, with all the
> elements, and then just walk it and calculate the sizes/pointers?
next_free_segment is removed from the latest patches as it's not needed.
>
> 5) I'm a bit confused about the segment/mapping difference. The patch
> seems to randomly mix those, or maybe I'm just confused. I mean,
> we are creating just shmem segment, and the pieces are mappings,
> right? So why do we index them by "shmem_segment"?
>
> Also, consider
>
> CreateAnonymousSegment(AnonymousMapping *mapping)
>
> so is that creating a segment or mapping? Or what's the difference?
>
> Or are we creating multiple segments, and I missed that? Or are there
> different "segment" concepts, or what?
>
> 6) There should probably be some sort of API wrapping the mappings, so
> that the various places don't need to mess with next_free_segments
> directly, etc. Perhaps PGSharedMemoryCreate() shouldn't do this, and
> should just pass size to CreateAnonymousSegment(), and that finding
> empty slot in Mappings, etc.? Not sure that'll work, but it's a bit
> error-prone if a struct is modified from multiple places like this.
Fixed this confusion in the latest patches. There are multiple
segments, each mapped to a different address space. Since we are using
mmap() + ftruncate(), the memory allocated for each segment is
controlled by a separate fds. If we use a single address space
reservation and carve multiple segments out of it, we can not resize
each segment separately. Just to clarify, we do not reserve a large
part of address space and then carve it into smaller segments.
>
> 7) We should remember which segments got to use huge pages and which
> did not. And we should make it optional for each segment. Although,
> maybe I'm just confused about the "segment" definition - if we only
> have one, that's where huge pages are applied.
>
> If we could have multiple segments for different segments (whatever
> that means), not sure what we'll report for cases when some segments
> get to use huge pages and others don't. Either because we don't want
> to use that for some segments, or because we happen to run out of
> the available huge pages.
This is an interesting idea. In the current implementation either we
use huge pages for all the segments or none of them. I think, per
segment huge page usage will be an add-on feature in the next version.
What do you think? Using separate segments for reserving separate
address spaces will make it easy to use huge pages for some and not
for others.
>
> 8) It seems PGSharedMemoryDetach got some significant changes, but the
> comment was not modified at all. I'd guess that means the comment is
> perhaps stale, or maybe there's something we should mention.
>
Done.
> 9) I doubt the Assert on GetConfigOption needs to be repeated for all
> segments (in CreateSharedMemoryAndSemaphores).
>
Done.
> 10) Why do we have the Mapping and Segments indexed in different ways?
> I mean, Mappings seem to be filled in FIFO (just grab the next free
> slot), while Segments are indexed by segment ID.
>
Fixed this confusion in the latest patches. Both are indexed by segment ID.
> 11) Actually, what's the difference between the contents of Mappings
> and Segments? Isn't that the same thing, indexed in the same way?
> Or could it be unified? Or are they conceptually different thing?
>
See explanation above.
> 12) I believe we'll have a predefined list of segments, with fixed IDs,
> so why not just have a MAX of those IDs as the capacity?
>
yes. Fixed in the latest patches.
> 13) Would it be good to have some checks on shmem_segment values? That
> it's valid with respect to defined segments, etc. An assert, maybe?
> What about some asserts on the Mapping/Segment elements? To check
> that the element is sensible, and that the arrays "match" (if we
> need both).
>
That's a good idea. Added Asserts to that effect.
> 14) Some of the lines got pretty long, e.g. in pg_get_shmem_allocations.
> I suggest we define some macros to make this shorter, or something
> like that.
>
Done.
> 15) I'd maybe rename ShmemSegment to PGShmemSegment, for consistency
> with PGShmemHeader?
Actually we track information about the shared memory segments at
multiple places. shmem.c has APIs similar to MemoryContext for shared
memory. The information there is consolidated into ShmemSegment
structure. pg_shmem.h has platform independent APIs to manage shared
memory segments and then each implementation has implementation
specific information. Some of that information is passed from parent
process (postmaster) to child process (postgresql backends). I have
created PGInhShmemSegment for the information that is passed from
postmaster to backends and then AnonymousShmemSegment structure for
tracking information about anonymous shared memory in sysv_shmem.c. I
don't like PGInhShmemSegment name, but I haven't figured out a better
name. I may rearrange these structures a bit more in the next patches.
>
> 16) Is MAIN_SHMEM_SEGMENT something we want to expose in a public header
> file? Seems very much like an internal thing, people should access
> it only through APIs ...
>
If we convert it to an enum, we can't avoid exposing it in a public
header file. In future, we may allow users to allocate memory in
specific segments. So I think it's fine to expose it.
>
> v5-0005-Address-space-reservation-for-shared-memory.patch
>
> 1) Shouldn't reserved_offset and huge_pages_on really be in the segment
> info? Or maybe even in mapping info? (again, maybe I'm confused
> about what these structs store)
I can't find reserved_offset in the latest patches. huge_pages_on is
now a global flag, either all segments use huge pages or none of them
do. As mentioned earlier, per segment huge page usage may be an add-on
separate feature.
>
> 2) CreateSharedMemoryAndSemaphores comment is rather light on what it
> does, considering it now reserves space and then carves is into
> segments.
>
The detailed comment is in PGSharedMemoryCreate(), which seems to be a
better place for explaining memory reservation etc. I just change the
comment here to mention plural shared memory segments and
differentiate between shared memory segments and the shared memory
structures. Please let me know if this looks good.
> 3) So ReserveAnonymousMemory is what makes decisions about huge pages,
> for the whole reserved space / all segments in it. That's a bit
> unfortunate with respect to the desirability of some segments
> benefiting from huge pages and others not. Maybe we should have two
> "reserved" areas, one with huge pages, one without?
>
See my responses above about mapping vs segment and per segment huge page usage.
> I guess we don't want too many segments, because that might make
> fork() more expensive, etc. Just guessing, though. Also, how would
> this work with threading?
See my earlier response about fork() performance and also limiting
number of segments to just 2. If we do see fork() is getting
expensive, we can limit the number of segments to 1.
>
> 4) Any particular reason to define max_available_memory as
> GUC_UNIT_BLOCKS and not GUC_UNIT_MB? Of course, if we change this
> to have "max shared buffers limit" then it'd make sense to use
> blocks, but "total limit" is not in blocks.
>
Right. See response about max_shared_buffers above.
> 5) The general approach seems sound to me, but I'm not expert on this.
> I wonder how portable this behavior is. I mean, will it work on other
> Unix systems / Windows? Is it POSIX or Linux extension?
See my earlier response about portability.
>
> 6) It might be a good idea to have Assert procedures to chech mappings
> and segments (that it doesn't overflow reserved space, etc.). It
> took me ages to realize I can change shared_buffers to >60% of the
> limit, it'll happily oblige and then just crash with OOM when
> calling mprotect().
>
This shouldn't happen with the latest patches. There are tests for the
same. Please let me know if you still face the issue again in your
tests.
>
> v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch
>
> 1) I suspect the SHMEM_RESIZE_RATIO is the wrong direction, because it
> entirely ignores relationships between the parts. See the earlier
> comment about this.
>
> 2) In fact, what happens if the user tries to resize to a value that is
> too large for one of the segments? How would the system know before
> starting the resize (and failing)?
>
+1. No SHMEM_RESIZE_RATIO in the latest patches. Reservation is
entirely based on max_shared_buffers.
> 3) It seems wrong to modify the BufferManagerShmemSize like this. It's
> probably better to have a "...SegmentSize" function for individual
> segments, and let BufferManagerShmemSize() to still return a sum of
> all segments.
>
I have reimplemented CalculateShmemSize() in the latest patches. Please review.
> 4) I think MaxAvailableMemory is the wrong abstraction, because that's
> not what people specify. See earlier comment.
>
+1. No MaxAvailableMemory in the latest patches.
> 5) Let's say we change the shared memory size (ALTER SYSTEM), trigger
> the config reload (pg_reload_conf). But then we find that we can't
> actually shrink the buffers, for some unpredictable reason (e.g.
> there's pinned buffers). How do we "undo" the change? We can't
> really undo the ALTER SYSTEM, that's already written in the .conf
> and we don't know the old value, IIRC. Is it reasonable to start
> killing backends from the assign_hook or something? Seems weird.
>
The buffer pool resizing is done using the function
pg_resize_shared_buffers(), which does not need rolling back the
config value. As mentioned earlier, appropriate error handling is
still a TBD.
>
> v5-0007-Allow-to-resize-shared-memory-without-restart.patch
>
> 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
> somewhere in the resizing?
>
It has been replaced by BufferManagerShmemResize() in the latest
patches. BufferManagerShmemResize() only resizes the buffer manager
related shared memory segments and data structures.
> 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
> Why is it needed, exactly? If it's to flush stuff for processes
> consuming EmitProcSignalBarrier, it's that too late? What if a
> process consumes the barrier between the emit and memory barrier?
>
> 3) WaitOnShmemBarrier seem a bit under-documented.
>
Both of these functions are not required in the new implementation.
Removed from the latest patches.
> 4) Is this actually adding buffers to the freelist? I see buf_init only
> links the new buffers by seeting freeNext, but where are the new
> buffers added to the existing freelist?
>
Freelist does not exist anymore.
> 5) The issue with a new backend seeing an old NBuffers value reminds me
> of the "support enabling checksums online" thread, where we ran into
> similar race conditions. See message [1], the part about race #2
> (the other race might be relevant too, not sure). It's been a while,
> but I think our conclusion ini that thread was that the "best" fix
> would be to change the order of steps in InitPostgres(), i.e. setup
> the ProcSignal stuff first, and only then "copy" the NBuffers value.
> And handle the possibility that we receive a "duplicate" barriers.
>
I plan to remove NBuffers entirely and instead use shared memory
variables so that we don't have to worry about backends inheriting
stale values from Postmaster or even involving Postmaster in the
resizing process. This is being worked upon and will be available in a
future patchset.
> 6) In fact, the online checksums thread seems like a possible source of
> inspiration for some of the issues, because it needs to do similar
> stuff (e.g. make sure all backends follow steps in a synchronized
> way, etc.). And it didn't need new types of Barrier to do that.
>
Right. New implementation does not require any changes to the barrier
implementation.
> 7) Also, this seems like a perfect match for testing using injection
> points. In fact, there's not a single test in the whole patch series.
> Or a single line of .sgml docs, for that matter. It took me a while
> to realize I'm supposed to change the size by ALTER SYSTEM + reload
> the config.
>
There are some tests in the latest patches. More tests are being added.
>
> v5-0008-Support-shrinking-shared-buffers.patch
>
> 1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't
> there a place starting it and waiting for it to complete? Why
> shouldn't it do EvictExtraBuffers itself?
>
evictor_pid is not used in the latest patches. Eviction is done in
pg_resize_shared_buffers() itself by the backend which executes that
function.
> 2) Isn't the change to BufferManagerShmemInit wrong? How do we know the
> last buffer is still at the end of the freelist? Seems unlikely.
>
No freelist anymore.
> 3) Seems a bit strange to do it from a random backend. Shouldn't it
> be the responsibility of a process like checkpointer/bgwriter, or
> maybe a dedicated dynamic bgworker? Can we even rely on a backend
> to be available?
>
The backend which executes pg_resize_shared_buffers() coordinates the
resizing itself. I have not seen a need for it to use another
background worker yet.
> 4) Unsolved issues with buffers pinned for a long time. Could be an
> issue if the buffer is pinned indefinitely (e.g. cursor in idle
> connection), and the resizing blocks some activity (new connections
> or stuff like that).
>
Resizing stops immediately after it encounters a pinned buffer. More
sophisticated handling of such situations is a TBD and mostly v2.
> 5) Funny that "AI suggests" something, but doesn't the block fail to
> reset nextVictimBuffer of the clocksweep? It may point to a buffer
> we're removing, and it'll be invalid, no?
>
TODO:
> 6) It's not clear to me in what situations this triggers (in the call
> to BufferManagerShmemInit)
>
> if (FirstBufferToInit < NBuffers) ...
>
That code does not exist in the latest patches.
>
> v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
>
> 1) IMHO this should be included in the earlier resize/shrink patches,
> I don't see a reason to keep it separate (assuming this is the
> correct way, and the "init" is not).
>
Right. Merged into the resizing patch in the latest patches.
> 2) Doesn't StrategyPurgeFreeList already do some of this for the case
> of shrinking memory?
No freelist anymore.
>
> 3) Not great adding a bunch of static variables to bufmgr.c. Why do we
> need to make "everything" static global? Isn't it enough to make
> only the "valid" flag global? The rest can stay local, no?
>
> If everything needs to be global for some reason, could we at least
> make it a struct, to group the fields, not just separate random
> variables? And maybe at the top, not half-way throught the file?
>
> 4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting
> anything, it's just invalidating the info about past runs.
Being worked upon.
>
> 5) I don't quite understand why BufferSync needs to do the dance with
> delay_shmem_resize. I mean, we certainly should not run BufferSync
> from the code that resizes buffers, right? Certainly not after the
> eviction, from the part that actually rebuilds shmem structs etc.
> So perhaps something could trigger resize while we're running the
> BufferSync()? Isn't that a bit strange? If this flag is needed, it
> seems more like a band-aid for some issue in the architecture.
>
> 6) Also, why should it be fine to get into situation that some of the
> buffers might not be valid, during shrinking? I mean, why should
> this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers).
> It seems better to ensure we never get into "sync" in a way that
> might lead some of the buffers invalid. Seems way too lowlevel to
> care about whether resize is happening.
>
> 7) I don't understand the new condition for "Execute the LRU scan".
> Won't this stop LRU scan even in cases when we want it to happen?
> Don't we want to scan the buffers in the remaining part (after
> shrinking), for example? Also, we already checked this shmem flag at
> the beginning of the function - sure, it could change (if some other
> process modifies it), but does that make sense? Wouldn't it cause
> problems if it can change at an arbitrary point while running the
> BufferSync? IMHO just another sign it may not make sense to allow
> this, i.e. buffer sync should not run during the "actual" resize.
>
Resizing runs for a few seconds. With default BufferSync frequency of
200ms, there will be a few cycles of BufferSync during resizing. If we
don't let BufferSync run during resizing, we might have more dirty
buffers to tackle post-resizing. I think it will be wise to let it run
during resizing in the portion of the buffer pool which is not being
removed. Working on that.
>
> v5-0010-Additional-validation-for-buffer-in-the-ring.patch
>
> 1) So the problem is we might create a ring before shrinking shared
> buffers, and then GetBufferFromRing will see bogus buffers? OK, but
> we should be more careful with these checks, otherwise we'll miss
> real issues when we incorrectly get an invalid buffer. Can't the
> backends do this only when they for sure know we did shrink the
> shared buffers? Or maybe even handle that during the barrier?
>
AFAIK, the buffer rings are inside the Scan nodes, which are not
accessible globally or to the barrier handling functions. So we can't
purge the rings only once during or after resizing. If we already have
a mechanism to access rings globally and hence in the barrier code, or
we can develop such a mechanism, we can purge the rings only once.
Please let me know if you have any ideas on this.
> 2) IMHO a sign there's the "transitions" between different NBuffers
> values may not be clear enough, and we're allowing stuff to happen
> in the "blurry" area. I think that's likely to cause bugs (it did
> cause issues for the online checksums patch, I think).
>
NBuffers is going to be replaced by shared memory variables. The
current code relies on buffer pool size being static for the entire
duration of the server. We are now changing that assumption. So, yes
there will be some "blurry" areas and bugs to tackle. We are
developing tests to uncover bugs and fix them.
Here's reply to email by Peter E [3].
On Tue, Nov 19, 2024 at 6:27 PM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
>
> On 18.10.24 21:21, Dmitry Dolgov wrote:
> > v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
> >
> > Preparation, introduces the possibility to work with many shmem mappings. To
> > make it less invasive, I've duplicated the shmem API to extend it with the
> > shmem_slot argument, while redirecting the original API to it. There are
> > probably better ways of doing that, I'm open for suggestions.
>
> After studying this a bit, I tend to think you should just change the
> existing APIs in place. So for example,
>
> void *ShmemAlloc(Size size);
>
> becomes
>
> void *ShmemAlloc(int shmem_slot, Size size);
>
> There aren't that many callers, and all these duplicated interfaces
> almost add more new code than they save.
>
> It might be worth making exceptions for interfaces that are likely to be
> used by extensions. For example, I see pg_stat_statements using
> ShmemInitStruct() and ShmemInitHash(). But that seems to be it. Are
> there any other examples out there? Maybe there are many more that I
> don't see right now. But at least for the initialization functions, it
> doesn't seem worth it to preserve the existing interfaces exactly.
>
> In any case, I think the slot number should be the first argument. This
> matches how MemoryContextAlloc() or also talloc() work.
>
Fixed. Please take a look at shmem.c changes.
> (Now here is an idea: Could these just be memory contexts? Instead of
> making six shared memory slots, could you make six memory contexts with
> a special shared memory type. And ShmemAlloc becomes the allocation
> function, etc.?)
I don't see a need to do that right now. But will revisit this idea.
>
> I noticed the existing code made inconsistent use of PGShmemHeader * vs.
> void *, which also bled into your patch. I made the attached little
> patch to clean that up a bit.
>
The latest patches are rebased on top of your commit.
> I suggest splitting the struct ShmemSegment into one struct for the
> three memory addresses and a separate array just for the slock_t's. The
> former struct can then stay private in storage/ipc/shmem.c, only the
> locks need to be exported.
> Also, maybe some of this should be declared in storage/shmem.h rather
> than in storage/pg_shmem.h. We have the existing ShmemLock in there, so
> it would be a bit confusing to have the per-segment locks elsewhere.
>
I have described the new structures above. Please review.
>
> Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.
Done.
>
> > v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch
> >
> > Splits shared_buffers into multiple slots, moving out structures that depend on
> > NBuffers into separate mappings. There are two large gaps here:
> >
> > * Shmem size calculation for those mappings is not correct yet, it includes too
> > many other things (no particular issues here, just haven't had time).
> > * It makes hardcoded assumptions about what is the upper limit for resizing,
> > which is currently low purely for experiments. Ideally there should be a new
> > configuration option to specify the total available memory, which would be a
> > base for subsequent calculations.
>
> Yes, I imagine a shared_buffers_hard_limit setting. We could maybe
> default that to the total available memory, but it would also be good to
> be able to specify it directly, for testing.
>
Latest patches introduce max_shared_buffers instead.
>
> > v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
> >
> > Allows an anonyous file to back a shared mapping. This makes certain things
> > easier, e.g. mappings visual representation, and gives an fd for possible
> > future customizations.
>
> I think this could be a useful patch just by itself, without the rest of
> the series, because of
>
> > * By default, Linux will not add file-backed shared mappings into a
> > core dump, making it more convenient to work with them in PostgreSQL:
> > no more huge dumps to process.
>
> This could be significant operational benefit.
>
> When you say "by default", is this adjustable? Does someone actually
> want the whole shared memory in their core file? (If it's adjustable,
> is it also adjustable for anonymous mappings?)
'man core' mentions that /proc/[PID]/coredump_filter file controls the
memory segments written to the core file. The value in the file is a
bitmask of memory mapping types. The bits correspond to the flags
passed to mmap(). This file can be written to by the program through
/proc/self/coredump_filter or externally by something like echo >
/proc/PID/coredump_filter. A boot time setting becomes default for all
the programs. A child process inherits the setting from parent through
fork() and execve() does not change it. So it looks like DBAs should
be able to control what gets dumped in postgresql core dumps by using
any of those options. If we use file backed shared memory as the
patches do, DBAs may have to change their default setting, which may
not have considered filed backed shared memory. We will need to
highlight this in the release notes. I have added a TODO for the same
in the code for now. We will add this note in the commit message or
update relevant document as the patches get committable.
The man page says that bits 0, 1, 4 are set by default (anonymous
private and shared mappings and ELF headers). But on my machine I see
that the default is different. It's possible that different
installations use different configurations.
>
> I'm wondering about this change:
>
> -#define PG_MMAP_FLAGS
> (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
> +#define PG_MMAP_FLAGS (MAP_SHARED|MAP_HASSEMAPHORE)
>
> It looks like this would affect all mmap() calls, not only the one
> you're changing. But that's the only one that uses this macro! I don't
> understand why we need this; I don't see anything in the commit log
> about this ever being used for any portability. I think we should just
> get rid of it and have mmap() use the right flags directly.
This is committed as c100340729b66dc46d4f9d68a794957bf2c468d8.
>
> I see that FreeBSD has a memfd_create() function. Might be worth a try.
> Obviously, this whole thing needs a configure test for memfd_create()
> anyway.
We are using memfd_create in the latest patches. I will look into
creating a configure test for memfd_create(). I have added a TODO
comment for the same.
>
> I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear
> how that interacts with the MAP_HUGETLB flag for mmap(). Do you need to
> specify both of them if you want huge pages?
We need both. I used toy program attached to [0] to verify that. Maybe
we could use similar program for configure test.
There are three important areas of puzzle that I will work on next:
1. Synchronization: The coordinator sends Proc signal barriers to all
the concurrent backends during resizing. The code which deals with the
buffer pool needs to react to these barriers and may need to adjust
its course of action. Checkpointer, background writer, new buffer
allocation, code scanning the buffer pool sequentially are a few
examples that need to react to the barrier code.
2. Graceful handling of errors and failures during resizing.
3. Portability: A config test to check whether a platform has the APIs
need to support this feature and enable the feature in the
corresponding build. On other platforms build succeeds, regression
runs but this feature is disabled.
All the TODOs that mentioned above and those not covered by the above
three points are added to the code. One of them being to reduce the
number of segments to just two (main and shared buffer blocks). I plan
to address them as the feature matures.
[0] https://www.postgresql.org/message-id/CAExHW5sVxEwQsuzkgjjJQP9-XVe0H2njEVw1HxeYFdT7u7J%2BeQ%40mail.gmail.com
[1] https://www.postgresql.org/message-id/CAExHW5sOu8%2B9h6t7jsA5jVcQ--N-LCtjkPnCw%2BrpoN0ovT6PHg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u%40yaogvphfwc4h
[3] https://www.postgresql.org/message-id/12add41a-7625-4639-a394-a5563e349322%40eisentraut.org
[4] https://www.postgresql.org/message-id/CAExHW5vTWABxuM5fbQcFkGuTLwaxuZDEE2vtx2WuMUWk6JnF4g%40mail.gmail.com
[5] https://www.postgresql.org/message-id/CAEze2WiMkmXUWg10y%2B_oGhJzXirZbYHB5bw0%3DVWte%2BYHwSBa%3DA%40mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
| Attachment | Content-Type | Size |
|---|---|---|
| v20260128-0001-Add-a-view-to-read-contents-of-shared-buff.patch | text/x-patch | 14.8 KB |
| v20260128-0003-Allow-to-resize-shared-memory-without-rest.patch | text/x-patch | 151.3 KB |
| v20260128-0002-Memory-and-address-space-management-for-bu.patch | text/x-patch | 101.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | torikoshia | 2026-01-28 13:27:04 | Re: RFC: Allow EXPLAIN to Output Page Fault Information |
| Previous Message | Chengpeng Yan | 2026-01-28 13:08:36 | Re: Hash-based MCV matching for large IN-lists |