Re: Changing shared_buffers without restart

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, chaturvedipalak1911(at)gmail(dot)com
Subject: Re: Changing shared_buffers without restart
Date: 2025-11-14 11:53:21
Message-ID: CAExHW5sVxEwQsuzkgjjJQP9-XVe0H2njEVw1HxeYFdT7u7J+eQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,
PFA new patchset with some TODOs from previous email addressed:

On Mon, Oct 13, 2025 at 9:28 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> 1. New backends join while the synchronization is going on.

Done. Explained the solution below in detail.

> An existing backend exiting.

Not tested specifically, but should work.

> 2. Failure or crash in the backend which is executing pg_resize_buffer_pool()

still a TODO

> 3. Fix crashes in the tests.

core regression passes, pg_buffercache regression tests pass and the
tests for buffer resizing pass most of the time. So far I have seen
two issues
1. An assertion from AIO worker - which happened only once and I
couldn't reproduce again. Need to study interaction of AIO worker with
buffer resizing.
2. checkpointer crashes - which is one of the TODOs listed below.
3. Also there's an shared memory id related failure, which I don't
understand but happen more frequently than the first one. Need to look
into that.

> go through Tomas's detailed comments and address those
> which still apply.

Still a TODO. But since many of those patches are revised heavily, I
think many of the comments may have been addressed, some may not apply
anymore.

> And the patches are still WIP, with many TODOs. But I wanted to get some feedback on the proposed UI and synchronization

This is still a request.

> Patches 0001 to 0016 are the same as the previous patchset. I haven't
> touched them in case someone would like to see an incremental change.
> However, it's getting unwieldy at this point, so I will squash
> relevant patches together and provide a patchset with fewer patches
> next.

I have squashed the patches into 3 so that it's easy to review, read
and work with those patches. The work is still WIP and there are many
TODOs in the patches.

Patch 0001: SQL interface to read contents of buffer lookup table. It
was there in the previous patchset as 0001 but in this patchset I have
moved the SQL function to the pg_buffercache module and renamed it
accordingly. I added this change because I found it useful to debug
issues I found while testing buffer resizing patches. The issues were
related to page->buffer mappings which existed in the buffer look up
table but were not present in the buffer descriptor array or buffer
blocks. pg_buffercache, which traverses just the buffer descriptor
array, isn't enough. Even without the resizing functionality this will
help us catch situations where buffer descriptor array and buffer
lookup table goes out of sync. I plan to keep it in this patchset as a
debugging tool. If other developers feel that it could be useful, I
will propose it in a separate thread.

Patch 0002: This is a single patch squashing all patches (0005, 0006,
0007, 0008, 0009 and 0010) related to shared memory management and
address space reservation together. This patch allows the creation of
multiple shared memory segments and also lays them out so as to make
those resizable. The actual code to resize the segments is in the next
patch. The APIs used for memory management and address space
reservation are described later. Prominent changes from the previous
patches are:
1. modifies CalculateShmemSize() so that it can work with multiple
shared memory segments.
2. It also combines AnonymousMapping and ShmemSegment structures
together as suggested by Tomas upthread. The merger is still going on.
There are some old comments or variable names referring to memory
mapping when they should be mentioning shared memory segments. I will
work on that when I start polishing this patch.
4. GUC to specify the maximum size of buffer pool has been renamed and
moved to the next patch which deals with actual resizing.
5. Changes to process config reload in AIO workers are removed. Those
are not needed after 55b454d0e14084c841a034073abbf1a0ea937a45.

Patch 0003: Implements the UI and synchronization described in the
previous email [1] with additional improvements to support a new
backend joining while resizing is in progress. This patch squashes
other patches 0002 - 0004 and 0011 onward patches from the previous
patchset, but it also gets rid of a lot of code related to the old
synchronization method and the old UI. The code related to resizing
including implementation of pg_resize_shared_buffers() is moved to
storage/buffer/buf_resize.c, a new file. There is no change to the UI.
The buffer resizing still looks like as described in the previous
email.

> SHOW shared_buffers; -- default
> shared_buffers
> ----------------
> 128MB
> (1 row)
>
> ALTER SYSTEM SET shared_buffers = '64MB';
> SELECT pg_reload_conf();
> pg_reload_conf
> ----------------
> t
> (1 row)
>
> SHOW shared_buffers;
> shared_buffers
> -----------------------
> 128MB (pending: 64MB)
> (1 row)
>
> SELECT pg_resize_shared_buffers();
> pg_resize_shared_buffers
> --------------------------
> t
> (1 row)
>
> SHOW shared_buffers;
> shared_buffers
> ----------------
> 64MB
> (1 row)
>
> ALTER SYSTEM SET shared_buffers = '256MB';
> SELECT pg_reload_conf();
> pg_reload_conf
> ----------------
> t
> (1 row)
>
> SHOW shared_buffers;
> shared_buffers
> -----------------------
> 64MB (pending: 256MB)
> (1 row)
>
> SELECT pg_resize_shared_buffers();
> pg_resize_shared_buffers
> --------------------------
> t
> (1 row)
>
> SHOW shared_buffers;
> shared_buffers
> ----------------
> 256MB
> (1 row)
>

The implementation uses a similar strategy as described in the
previous email with changes described below.

A new backend inherits the address space of shared memory segments and
the local variable NBuffers through Postmaster. These are changed when
resizing the buffer pool. And the same changes need to be applied to
the Postmaster so that a new backend inherits them. Since Postmaster
is not part of the ProcSignalBarrier mechanism, the coordinator has to
send signals to the Postmaster separately. This has the following
drawbacks
1. Additional code to signal Postmaster
2. coordinator has to wait for Postmaster to apply the changes
separately, thus adding extra delays
3. platforms which use fork() + exec(), will add more complexity to
transfer the state to new child
4. If the postmaster is signaled after sending a barrier to other
backends, the newly joined backend will miss the state update as well
as the barrier. If the postmaster is signaled before sending a barrier
to other backends, a newly joining backend will receive the barrier as
well as state update from Postmaster. This means the barrier handling
code is required to be idempotent. This will make the barrier handling
code more complex and also constrained.

Instead the approach taken by Thomas Munro in [2] does not require
updating the address space. It uses shared memory variables instead of
process local memory variables to save the state of the shared buffer
pool. This patchset uses a similar approach and
1. avoids involving Postmaster in the resizing process
2. additionally making barrier handling code super thin.

Shared Memory and address space management
========================================
An fd is created using memfd_create to manage the size of the shared
memory segment using ftruncate and fallocate(). That fd is passed to
mmap() which reserves the maximum required address space and maps the
anonymous file (and the backing memory) in that address space. mmap
uses MAP_NORESERVE so as not to allocate memory against mapping. The
size of the anonymous file controls the amount of memory allocated.
For the main shared memory segment, the size of the reserved space is
the same as the amount of memory required. But for shared buffer pool
related segments the size of the reserved space is decided by GUC
max_shared_buffers (mentioned in the previous email and quoted below).
When resizing shared buffers only the anonymous file is resized and
not the address space. I tested this protocol with an attached small
program (mfdtruncate.c). Sharing it in case somebody finds it useful.

Saving shared buffer pool sizes in the shared memory
=========================================
When resizing, we need to track two ranges of buffers 1. active
buffers, which is the range of buffers from which the new allocations
happen at a given time and 2. valid buffers which is the range of
buffers which are valid at a given time. When shrinking, the active
buffers is set to the new size while the valid buffers remains same as
the old size till all the buffers outside the new size are evicted.
When expanding, valid buffers and active buffers are both changed to
new size after memory is resized and expanded data structures are
initialized. Current global variable NBuffers is insufficient to track
these two numbers.

Instead we have a new member StrategyControl::activeNBuffers which
tracks the active buffer range. The shared memory structure
controlling the resizing operation (ShmemCtrl) has a member
currentNBuffers which gives the range of valid number of shared
buffers at a given point in time. (I am planning to merge ShmemCtrl
and StrategyControl, so that we have all the metadata about shared
buffers in one place in the shared memory). These two numbers are
saved in the shared memory for the reasons explained below and replace
current NBuffers. They are modified by the coordinator as the resizing
progresses. Some usages of NBuffers are replaced by one of the two
variables as appropriate but more work is required.

Next I will be working on
1. Background writer synchronization
2. Checkpoint synchronization
3. Make all the shared buffer pool structures, except buffer blocks,
static and maximally allocated as suggested by Andres earlier. [3]
4. Replace NBuffers usages as explained above
3. merge ShmemCtrl and StrategyControl as explained above
4. Handle failures in resizing
5. There have been concerns raised earlier that anonymous file backed
memory is not dumped with core. I am thinking of not using an
anonymous file for the main memory segment so that it gets dumped with
core. But shared buffers still will be dumped. However, I am skeptical
as to whether we need GBs (say) of shared buffers being dumped along
with core or should we leave that choice to users.

[1] https://www.postgresql.org/message-id/CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BhUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u%40yaogvphfwc4h

--
Best Wishes,
Ashutosh Bapat

Attachment Content-Type Size
0001-Add-a-view-to-read-contents-of-shared-buffe-20251114.patch text/x-patch 14.7 KB
0004-WIP-test-shared-buffers-resizing-and-checkp-20251114.patch text/x-patch 11.2 KB
0003-Allow-to-resize-shared-memory-without-resta-20251114.patch text/x-patch 138.9 KB
0002-Memory-and-address-space-management-for-buf-20251114.patch text/x-patch 70.1 KB
mfdtruncate.c text/x-csrc 2.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-11-14 11:53:41 Re: [Patch] Mention md5 is deprecated in postgresql.conf.sample
Previous Message Rahila Syed 2025-11-14 11:40:35 Re: Missing calls to UnlockBuffers() - unify error handling?