From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, chaturvedipalak1911(at)gmail(dot)com |
Subject: | Re: Changing shared_buffers without restart |
Date: | 2025-10-13 15:58:09 |
Message-ID: | CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I started studying the interaction of the checkpointer process with
buffer pool resizing. Soon I noticed that the checkpointer didn't load
the config as frequently as other backends. When it is executing a
checkpoint, it does not reload the config for the entire duration of
the checkpoint for example. As the synchronization is implemented, in
the set of patches so far, the checkpointer will not see the new value
of shared_buffers and will not acknowledge the proc signal barrier and
thus not enter the synchronized buffer resizing. However, other
backends will notice that the checkpointer has received the proc
signal barrier and will enter the synchronization process. Once the
proc signal barrier is received by all the backends, the backends
which have entered the synchronization process will move forward with
resizing buffer pool leaving behind those who have received but not
acknowledged the proc signal barrier. At the end there will be two
sets of backends, one which have entered synchronization and see the
buffer pool with new size and the other which haven't entered
synchronization and do not see the buffer pool with new size. This
leads to SIGBUS, SIG 11 in the other set of backends. I saw this
mostly with the checkpointer process but we also saw it with other
types of backends.
Every aspect of buffer resizing that I started looking at was blocked
by this behaviour. Since there were already other suggestions and
comments about the current UI as well as synchronization mechanism, I
started implementing a different UI and synchronization as described
below. The WIP implementation is available in the attached set of
patches.
Patches 0001 to 0016 are the same as the previous patchset. I haven't
touched them in case someone would like to see an incremental change.
However, it's getting unwieldy at this point, so I will squash
relevant patches together and provide a patchset with fewer patches
next.
0017 reverts to 0003 and gets rid of the "pending" GUC flag which is
not required by the new UI. They will vanish from the next patchset.
0018 implements the new UI described below.
New UI and synchronization
======================
0018 changes the way "shared_buffers" is handled.
a. A new global variable NBuffersPending is used to hold the value of
this GUC. When the server starts, shared memory required by the buffer
manager is calculated using NBuffersPending instead of NBuffers. Once
the shared memory is allocated, NBuffers is set to NBuffersPending.
NBuffers, thus shows the number of buffers in the buffer pool instead
of the value of the GUC.
b. "shared_buffers" is PGC_SIGHUP now so it can be changed using ALTER
SYSTEM ... SET shared_buffers = ...; followed by SELECT
pg_reload_config(). But this does not resize the buffer pool. It
merely sets NBuffersPending to the new value. A new function
pg_resize_buffer_pool() (described later) can be used to resize the
buffer pool to the pending value.
c. show "shared_buffers" shows the value of NBuffers, and
NBuffersPending if it differs from NBuffers. I think we need some
adjustment here when the resizing is in progress since the value of
NBuffers would be changed to the size of the active buffer pool
(explained later in the email), but I haven't worked out those details
yet.
A new GUC max_shared_buffers sets the upper limit on "shared_buffers".
It is PGC_POSTMASTER; requires a restart to change the value. This GUC
is used a. to reserve the address space for future expansion of the
buffer pool and b. allocate memory for a maximally sized buffer lookup
table at the server start. We may decide to use the GUC to maximally
allocate data structures other than buffer blocks as suggested by
Andres. But these patches don't do that. The default for this GUC is
0, which means it will be the same as shared_buffers. This maintains
backward compatibility and also allows systems, which do not want to
resize shared buffer pool, to allocate minimum memory. When it is set
to a value other than 0, it should be set to a value higher than the
shared_buffers at the start.
We need to support the ALTER SYSTEM ... SET shared_buffers = "" for
backward compatibility. The users will still be able to perform ALTER
SYSTEM and restart the server with a newer size of buffer pool. Also
this allows the new buffer pool size to be written to
postgresql.auto.conf and persist it. With this we can simply use
pg_reload_conf() to load the new value along with other GUC changes.
pg_resize_buffer_pool() merely picks the new value from the backend
where it is executed and resizes the buffer pool. It does not need the
new value to be loaded in all the backends.
We may want to use a new PGC_ for this GUC but PGC_SIGHUP suffices for
the time being and it might be acceptable with clear documentation.
pg_resize_buffer_pool() implements phase wise buffer pool resizing
operation, but it does not block all the backends till the buffer pool
resizing is finished. It works as follows: Pasting from the prologue
in patch 0018.
When resizing the buffer pool is divided into two portions
- active buffer pool, which is the part of the buffer pool which
remains active even during resizing. Its size is given by
activeNBuffers. Newly allocated buffers will have their buffer ids
less than activeNBuffers.
- in-transit buffer pool, which is the part of the buffer pool which
may be accessible to some backends but not others depending upon the
time when a given backend processes a shrink/expand barrier. When
shrinking the buffer pool this is the part of the buffer pool which
will be evicted. When expanding the buffer pool this is the expanded
portion. Its size is given by transitNBuffers. The backends may see
buffer ids upto transitNBuffers till the resizing finishes.
Before starting resizing, activeNBuffers = transitNBuffers = NBuffers
where NBuffers is the size of buffer pool before resizing. NewNBuffers
is the new size of the shared buffer pool. After resizing finishes
activeNBuffers = transitNBuffers = NBuffers = newNBuffers.
In order to synchronize with other running backends, the coordinator
sends following ProcSignalBarriers in the order given below:
1. When shrinking the shared buffer pool the coordinator sends
SHBUF_SHRINK ProcSignalBarrier. Every backend sets activeNBuffers =
NewNBuffers to restrict its buffer pool allocations to the new size of
the buffer pool and acknowledges the ProcSignalBarrrier. Once every
backend has acknowledged, the coordinator evicts the buffers in the
area being shrunk. Note that tansitNBuffers is still NBuffers, so the
backends may see buffer ids upto NBuffers from earlier allocations
till eviction completes.
2. In both cases, when expanding the buffer pool or shrinking the
buffer pool, the coordinator sends SHBUF_RESIZE_MAP_AND_MEM
ProcSignalBarrier after resizing the shared memory segments and
initializing the required data structures if any. Every backend is
expected to adjust their shared memory segment address maps (by
calling AnonymousShmemResize()) and validate that their pointers to
the shared buffers structure are valid and have the right size. When
shrinking shared buffer pool transitNBuffers is set to NewNBuffers and
the backends should no longer see buffer ids beyond NewNBuffers; the
buffer resizing operation is finished at this stage. When expanding
they should set transitNBuffers to NewNBuffers to accommodate for the
backends which may accept the next barrier earlier than the others.
Once every backend acknowledges this barrier, the coordinator sends
the next barrier when expanding the buffer pool.
3. When expanding the buffer pool, the coordinator sends SHBUF_EXPAND
ProcSignalBarrier. The backends are expected to set activeNBuffers =
NewNBuffers and start allocating buffers from the expanded range. The
coordinator uses this barrier to know when all the backends have
settled using the new size of the buffer pool.
For either operation, at most two barriers are sent.
All this together in action looks like (See tests in the patch for
more examples)
SHOW shared_buffers; -- default
shared_buffers
----------------
128MB
(1 row)
ALTER SYSTEM SET shared_buffers = '64MB';
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
SHOW shared_buffers;
shared_buffers
-----------------------
128MB (pending: 64MB)
(1 row)
SELECT pg_resize_shared_buffers();
pg_resize_shared_buffers
--------------------------
t
(1 row)
SHOW shared_buffers;
shared_buffers
----------------
64MB
(1 row)
ALTER SYSTEM SET shared_buffers = '256MB';
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
SHOW shared_buffers;
shared_buffers
-----------------------
64MB (pending: 256MB)
(1 row)
SELECT pg_resize_shared_buffers();
pg_resize_shared_buffers
--------------------------
t
(1 row)
SHOW shared_buffers;
shared_buffers
----------------
256MB
(1 row)
On Thu, Sep 18, 2025 at 7:22 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Sun, 6 Apr 2025 16:40:32 +0200
> > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
> >
> > Currently an assing hook can perform some preprocessing of a new value,
> > but it cannot change the behavior, which dictates that the new value
> > will be applied immediately after the hook. Certain GUC options (like
> > shared_buffers, coming in subsequent patches) may need coordinating work
> > between backends to change, meaning we cannot apply it right away.
> >
> > Add a new flag "pending" for an assign hook to allow the hook indicate
> > exactly that. If the pending flag is set after the hook, the new value
> > will not be applied and it's handling becomes the hook's implementation
> > responsibility.
>
> I doubt it makes sense to add this to the GUC system. I think it'd be better
> to just use the GUC value as the desired "target" configuration and have a
> function or a show-only GUC for reporting the current size.
This has been taken care of in the new implementation with slightly
different approach to show command as described above.
>
> I don't think you can't just block application of the GUC until the resize is
> complete. E.g. what if the value was too big and the new configuration needs
> to fixed to be lower?
>
With the above approach, the application of the GUC won't be blocked
but if the size being applied is taking too long, the operation will
be required to be cancelled before the new resize can happen. That's a
part that needs some work. Chasing a moving target requires a very
complex implementation, which would be good to avoid in the first
version at least. However, we should leave room for that future
enhancement. The current implementation gives that flexibility, I
think.
>
> > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Fri, 4 Apr 2025 21:46:14 +0200
> > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
> >
> > Currently WaitForProcSignalBarrier allows to make sure the message sent
> > via EmitProcSignalBarrier was processed by all ProcSignal mechanism
> > participants.
> >
> > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
> > which will be updated when a process has received the message, but not
> > processed it yet. This makes it possible to support a new mode of
> > waiting, when ProcSignal participants want to synchronize message
> > processing. To do that, a participant can wait via
> > WaitForProcSignalBarrierReceived when processing a message, effectively
> > making sure that all processes are going to start processing
> > ProcSignalBarrier simultaneously.
>
> I doubt "online resizing" that requires synchronously processing the same
> event, can really be called "online". There can be significant delays in
> processing a barrier, stalling the entire server until that is reached seems
> like a complete no-go for production systems?
>
> > From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Tue, 17 Jun 2025 14:16:55 +0200
> > Subject: [PATCH 11/16] Allow to resize shared memory without restart
> >
> > Add assing hook for shared_buffers to resize shared memory using space,
> > introduced in the previous commits without requiring PostgreSQL restart.
> > Essentially the implementation is based on two mechanisms: a
> > ProcSignalBarrier is used to make sure all processes are starting the
> > resize procedure simultaneously, and a global Barrier is used to
> > coordinate after that and make sure all finished processes are waiting
> > for others that are in progress.
> >
> > The resize process looks like this:
> >
> > * The GUC assign hook sets a flag to let the Postmaster know that resize
> > was requested.
> >
> > * Postmaster verifies the flag in the event loop, and starts the resize
> > by emitting a ProcSignal barrier.
> >
> > * All processes, that participate in ProcSignal mechanism, begin to
> > process ProcSignal barrier. First a process waits until all processes
> > have confirmed they received the message and can start simultaneously.
>
> As mentioned above, this basically makes the entire feature not really
> online. Besides the latency of some processes not getting to the barrier
> immediately, there's also the issue that actually reserving large amounts of
> memory can take a long time - during which all processes would be unavailable.
>
> I really don't see that being viable. It'd be one thing if that were a
> "temporary" restriction, but the whole design seems to be fairly centered
> around that.
In the new implementation regular backends are not stalled when the
resizing is going on. They continue their work with possible temporary
performance degradation (this needs to be measured).
>
> > From experiment it turns out that shared mappings have to be extended
> > separately for each process that uses them. Another rough edge is that a
> > backend blocked on ReadCommand will not apply shared_buffers change
> > until it receives something.
>
> That's not a rough edge, that basically makes the feature unusable, no?
New synchronization doesn't have this problem since it doesn't require
every backend to load the new value. The value being loaded only in
the backend where pg_resize_buffer_pool() is being run is enough.
>
> > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Tue, 17 Jun 2025 11:22:02 +0200
> > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers
> >
> > Add more shmem segments to split shared buffers into following chunks:
> > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
> > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
> > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
> > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
> > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status
>
> Why do all these need to be separate segments? Afaict we'll have to maximally
> size everything other than BUFFERS_SHMEM_SEGMENT at start?
>
I am leaning towards that. I will implement that soon.
On Wed, Oct 1, 2025 at 2:40 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
>
> I see you folks are inclined to keep some small segments static and
> allocate maximum allowed memory for it. It's an option, at the end of
> the day we need to experiment and measure both approaches.
I did measure performance with a maximally sized buffer lookup table
(shared_buffers = 128MB, max_shared_buffers = 10GB) on my laptop.
There was no noticeable difference in the performance. I will post
formal numbers with the next patchset.
>
>
> > * Every process recalculates shared memory size based on the new
> > NBuffers, adjusts its size using ftruncate and adjust reservation
> > permissions with mprotect. One elected process signals the postmaster
> > to do the same.
>
> If we just used a single memory mapping with all unused parts marked
> MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other
> work in this patchset)..
>
On Sat, Sep 27, 2025 at 12:06 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > How do we return memory to the OS in that case? Currently it's done
> > explicitly via truncating the anonymous file.
>
> madvise with MADV_DONTNEED or MADV_REMOVE.
The patchset still uses the ftruncate + mprotect. I have questions
apart from portability concerns about your proposal. MADV_DONTNEED
documentation says
After a successful MADV_DONTNEED operation, the
semantics of memory access in the specified region are changed:
subsequent accesses
of pages in the range will succeed, but will result
in either repopulating the memory contents from the up-to-date
contents of the
underlying mapped file (for shared file mappings, shared
anonymous mappings, and shmem-based techniques such as System V shared
mem‐
ory segments) or zero-fill-on-demand pages for anonymous
private mappings.
Note that, when applied to shared mappings,
MADV_DONTNEED might not lead to immediate freeing of the pages in the
range. The kernel
is free to delay freeing the pages until an appropriate
moment. The resident set size (RSS) of the calling process will be
immedi‐
ately reduced however.
MADV_DONTNEED cannot be applied to locked pages, Huge
TLB pages, or VM_PFNMAP pages. (Pages marked with the kernel-internal
VM_PFN‐
MAP flag are special memory areas that are not managed
by the virtual memory subsystem. Such pages are typically created by
device
drivers that map the pages into user space.)
and MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated
backing store. This is equivalent to punching a hole in the
corresponding byte
range of the backing store (see fallocate(2)).
Subsequent accesses in the specified address range will see bytes
containing zero.
The specified address range must be mapped shared and
writable. This flag cannot be applied to locked pages, Huge TLB
pages, or
VM_PFNMAP pages.
Combining these two,
1. The access to the freed memory doesn't give any error but returns
0. Won't that lead to silent corruption?
2. Those are not supported with huge tlb pages. So can not be used
when huge pages = on?
With the current approach, we get SIGBUS and SIG 11 when the process
tries to access the freed memory. That protection won't be there with
madvise().
The synchronization mechanism in this patch is inspired from Thomas's
implementation posted in [1].
I still need to go through Tomas's detailed comments and address those
which still apply. And the patches are still WIP, with many TODOs. But
I wanted to get some feedback on the proposed UI and synchronization
as described above.
I will be looking into the cases below one by one
1. New backends join while the synchronization is going on. An
existing backend exiting.
2. Failure or crash in the backend which is executing pg_resize_buffer_pool()
3. Fix crashes in the tests.
[1] postgr.es/m/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
From | Date | Subject | |
---|---|---|---|
Next Message | Melanie Plageman | 2025-10-13 16:05:11 | Re: Fix overflow of nbatch |
Previous Message | Dave Page | 2025-10-13 15:49:38 | Re: Build failure with Meson >= 1.8.3 on Windows |