From: | Dmitry Dolgov <9erthalion6(at)gmail(dot)com> |
---|---|
To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Jack Ng <Jack(dot)Ng(at)huawei(dot)com>, Ni Ku <jakkuniku(at)gmail(dot)com> |
Subject: | Re: Changing shared_buffers without restart |
Date: | 2025-06-20 10:19:31 |
Message-ID: | my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2 |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
> On Mon, Apr 21, 2025 at 7:47 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> wrote:
>
> One thing I'm still wondering about is whether you really need all
> this multi-phase barrier stuff, or even need to stop other backends
> from running at all while doing the resize. I guess that's related to
> your remapping scheme, but supposing you find the simple
> ftruncate()-only approach to be good, my next question is: why isn't
> it enough to wait for all backends to agree to stop allocating new
> buffers in the range to be truncated, and then left them continue to
> run as normal? As far as they would be concerned, the in-progress
> downsize has already happened, though it could be reverted later if
> the eviction phase fails. Then the coordinator could start evicting
> buffers and truncating the shared memory object, which are
> phases/steps, sure, but it's not clear to me why they need other
> backends' help.
My intention behind keeping all backends waiting was to have a simple way of
not only preventing them from allocating new buffers from the truncated range,
but also eliminating any chance of them accessing those to-be-truncated
buffers. In the end it's just easier (at least for me) to reason about
correctness of the implementation this way.
> On Tue, Jun 10, 2025 at 04:39:58PM +0530, Ashutosh Bapat wrote:
>
> Here's patchset rebased on f85f6ab051b7cf6950247e5fa6072c4130613555
Thanks! I've reworked the series to implement approach suggested by
Thomas, and applied your patches to support buffers shrinking on top. I
had to restructure the patch set, here is how it looks like right now:
1. Preparation patches
Changes, that are needed to support resizing functionality, but not
strictly related to it.
* Process config reload in AIO workers. Corrects omission discussed on [1].
* Introduce pending flag for GUC assign hooks. Allowing to decouple a GUC value
change from actually applying it, sort of "pending" change. The idea is to let
a custom logic be triggered on an assign hook, and then take responsibility for
what happens later and how it's going to be applied. Doesn't do GUC reporting
yet.
* Introduce pss_barrierReceivedGeneration. Allows to distinguish situations
when a signal was processed everywhere, and when a signal was received
everywhere.
2. Resizing implementation
* Allow to use multiple shared memory mappings. A preparation patch, extending
the existing interface to support multiple shared memory segments.
* Address space reservation for shared memory. Implement the new way of
handling shared memory segments, now each segment can visually be represented
as following:
/ Address space \
+---------------<+>--------------------------+
| Actual content | Address space reservation |
| (memfd) | (mmap, PROT_NONE) |
+---------------<+>--------------------------+
The actual segment size is managed via ftruncate and mprotect. One interesting
side effect I haven't fully understood yet, is that Linux doesn't seem to
extend the existing mapping when doing mprotect on huge pages, it creates
another mapping instead. E.g. when using normal page size and resizing shared
memory we get:
7f4808600000-7f4817e00000 rw-s /memfd:buffers (deleted)
7f4817e00000-7f48a2000000 ---s /memfd:buffers (deleted)
Doing the same with huge pages ends up looking like this:
7f4808600000-7f4817e00000 rw-s /memfd:buffers (deleted)
7f4817e00000-7f4830000000 rw-s /memfd:buffers (deleted)
7f4830000000-7f48a2000000 ---s /memfd:buffers (deleted)
I'm still investigating whether it's a mistake on my side or a genuine Linux
behavior. At the same time I don't see it as a large issue, the same situation
could happen with the previous implementation as well.
* Introduce multiple shmem segments for shared buffers. Modifies necessary bits
to use new functionality.
* Allow to resize shared memory without restart. Utilizes infrastructure
introduced so far to implement stop-the-world resizing approach, where all the
active backend (and potentially new one spawning) are waiting until everyone
gets the same shared memory size.
When testing I've noticed that there seems to be concurrency issues with
interrupts, where aio workers and checkpointer sometimes do not receive
the resize signal correctly. I assume it has something to do with the
significant behavior change -- config reload processing can now fire
signals on its own. Letting those backends to always process config
reload first seems to be resolved (or at least hide) the issue, but I
still need to understand what's going on there.
3. Shared memory shrinking
So far only shared memory increase was implemented. These patches from Ashutosh
support shrinking as well, which is tricky due to the need for buffer eviction.
* Support shrinking shared buffers
* Reinitialize StrategyControl after resizing buffers
* Additional validation for buffer in the ring
> 0009 adds support to shrink shared buffers. It has two changes: a.
> evict the buffers outside the new buffer size b. remove buffers with
> buffer id outside the new buffer size from the free list. If a buffer
> being evicted is pinned, the operation is aborted and a FATAL error is
> raised. I think we need to change this behaviour to be less severe
> like rolling back the operation or waiting for the pinned buffer to be
> unpinned etc. Better even if we could let users control the behaviour.
> But we need better infrastructure to do such things. That's one TODO
> left in the patch.
I haven't reviewed those, just tested a bit to finally include into the series.
Note that I had to tweak two things:
* The way it was originally implemented was sending resize signal to postmaster
before doing eviction, which could result in sigbus when accessing LSN of a
dirty buffer to be evicted. I've reshaped it a bit to make sure eviction always
happens first.
* It seems the CurrentResource owner could be missing sometimes, so I've added
a band-aid checking its presence.
One side note, during my testing I've noticed assert failures on
pgstat_tracks_io_op inside a wal writer a few times. I couldn't reproduce it
after the fixes above, but still it may indicate that something is off. E.g.
it's somehow not expected that the wal writer will do buffer eviction IO (from
what I understand, the current shrinking implementation allows that).
> Buffer lookup table resizing
> ------------------------------------
> The size of the buffer lookup table depends upon (number of shared
> buffers + number of partitions in the shared buffer lookup table). If we
> shrink the buffer pool, the buffer lookup table will become sparse but
> still useful. If we expand the buffers we need to expand the buffer lookup
> table too. That's not implemented in the current patchset.
Just FYI, buffer lookup table has its own STRATEGY_SHMEM_SEGMENT shared memory
segment and is resized in the same way as others. There could be lots of
details missing, but at least the corresponding resizable segment is already
there.
Attachment | Content-Type | Size |
---|---|---|
v5-0001-Process-config-reload-in-AIO-workers.patch | text/plain | 1.8 KB |
v5-0002-Introduce-pending-flag-for-GUC-assign-hooks.patch | text/plain | 12.8 KB |
v5-0003-Introduce-pss_barrierReceivedGeneration.patch | text/plain | 7.3 KB |
v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch | text/plain | 30.5 KB |
v5-0005-Address-space-reservation-for-shared-memory.patch | text/plain | 24.2 KB |
v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch | text/plain | 11.7 KB |
v5-0007-Allow-to-resize-shared-memory-without-restart.patch | text/plain | 41.3 KB |
v5-0008-Support-shrinking-shared-buffers.patch | text/plain | 14.0 KB |
v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch | text/plain | 16.8 KB |
v5-0010-Additional-validation-for-buffer-in-the-ring.patch | text/plain | 2.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dmitry Dolgov | 2025-06-20 10:22:42 | Re: Changing shared_buffers without restart |
Previous Message | Ivan Kush | 2025-06-20 10:08:35 | Re: [PoC] Federated Authn/z with OAUTHBEARER |