From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Changing shared_buffers without restart |
Date: | 2025-09-18 04:55:29 |
Message-ID: | CAExHW5vB8sAmDtkEN5dcYYeBok3D8eAzMFCOH1k+krxht1yFjA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jun 16, 2025 at 6:09 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> >
> > Buffer lookup table resizing
> > ------------------------------------
I looked at the interaction of shared buffer lookup table with buffer
resizing as per the patches in [0]. Here's list of my findings, issues
and fixes.
1. The basic structure of buffer lookup table (directory and control
area etc.) is allocated in a shared memory segment dedicated to the
buffer lookup table. However, the entries are allocated in the shared
memory using ShmemAllocNoError() which allocates the entries in the
main memory segment. In order for ShmemAllocNoError() to allocate
entries in the dedicated shared memory segment, it should know the
shared memory segment. We could do that by setting the segment number
in element_alloc() before calling hashp->alloc(). This is similar to
how ShmemAllocNoError() knows the memory context in which to allocate
the entries on heap. But read on ...
2. When the buffer pool is expanded, an "out of shared memory" error
is thrown when more entries are added to the buffer look up table. We
could temporarily adjust that flag and allocate more entries. But the
directory also needs to be expanded proportionately otherwise it may
lead to more contention. Expanding directory is non-trivial since it's
a contiguous chunk of memory, followed by other data structures.
Further, expanding directory would require rehashing all the existing
entries, which may impact the time taken by the resizing operation and
how long other backends remain blocked.
3. When the buffer pool is shrunk, there is no way to free the extra
entries in such a way that a contiguous chunk of shared memory can be
given back to the OS. In case we implement it, we will need some way
to compact the shrunk entries in contiguous chunk of memory and unmap
remaining chunk. That's some significant code.
Given these things, I think we should set up the buffer lookup table
to hold maximum entries required to expand the buffer pool to its
maximum, right at the beginning. The maximum size to which buffer pool
can grow is given by GUC max_available_memory (which is a misnomer and
should be renamed to max_shared_buffers or something), introduced by
previous set of patches [0]. We don't shrink or expand the buffer
lookup table as we shrink and expand the buffer pool. With that the
buffer lookup table can be located in the main memory segment itself
and we don't have to fix ShmemAllocNoError().
This has two side effects:
1. larger hash table makes hash table operations slower [2]. Its
impact on actual queries needs to be studied.
2. There's increase in the total shared memory allocated upfront.
Currently we allocate 150MB memory with all default GUC values. With
this change we will allocate 250MB memory since max_available_memory
(or rather max_shared_buffers) defaults to allow 524288 shared
buffers. If we make max_shared_buffers to default to shared_buffers,
it won't be a problem. However, when a user sets max_shared_buffers
themselves, they have to be conscious of the fact that it will
allocate more memory than necessary with given shared_buffers value.
This fix is part of patch 0015.
The patchset contains more fixes and improvements as described below.
Per TODO in the prologue of CalculateShmemSize(), more than necessary
shared memory was mapped and allocated in the buffer manager related
memory segments because of an error in that function; the amount of
memory to be allocated in the main shared memory segment was added to
every other shared memory segment. Thus shrinking those memory
segments didn't actually affect the objects allocated in those.
Because of that, we were not seeing SIGBUS even when the objects
supposedly shrunk were accessed, masking bugs in the patches. In this
patchset I have a working fix for CalculateShmemSize(). With that fix
in place we see server crashing with SIGBUS in some resizing
operations. Those cases need to be investigated. The fix changes its
minions to a. return size of shared memory objects to be allocated in
the main memory segment and b. add sizes of the shared memory objects
to be allocated in other memory segments in the respective
AnonymousMapping structures. This assymetry between main segment and
other segment exists so as not to change a lot the minions of
CalculateShmemSize(). But I think we should eliminate the assymetry
and change every minion to add sizes in the respective segment's
AnonymousMapping structure. The patch proposed at [3] would simplify
CalculateShmemSize() which should help eliminating the assymetry.
Along with refactoring CalculateShmemSize() I have added small fixes
to update the total size and end address of shared memory mapping
after resizing them and also to update the new allocated_sizes of
resized structures in ShmemIndex entry. Patch 0009 includes these
changes.
I found that the shared memory resizing synchronization is triggered
even before setting up the shared buffers the first time after
starting the server. That's not required and also can lead to issues
because of trying to resize shared buffers which do not exist. A WIP
fix is included as patch 0012. A TODO in the patch needs to be
addressed. It should be squashed into an earlier patch 0011 when
appropriate.
While debugging the above mentioned issues, I found it useful to have
an insight into the contents of buffer lookup table. Hence I added a
system view exposing the contents of the buffer lookup table. This is
added as patch 0001 in the attached patchset. I think it's useful to
have this independent of this patchset to investigate inconsistencies
between the contents of shared buffer pool and buffer lookup table.
Again for debugging purposes, I have added a new column "segment" in
pg_shmem_allocations reporting the shared memory segment in which the
given allocation has happened. I have also added another view
pg_shmem_segments to provide information about the shared memory
segments. This view definition will change as we design shared memory
mappings and shared memory segments better. So it's WIP and needs doc
changes as well. I have included it in the patchset as patch 0011
since it will be helpful to debug issues found in the patch when
testing. The patch should be merged into patch 0007.
Last but not the least, patch 0016 contains two tests a. stress test
to run buffer resizing while pgbench is running, b. a SQL test to test
the sizes of segments and shared memory allocations after resizing.
The stress test polls "show shared_buffers" output to know when the
resizing is finished. I think we need a better interface to know when
resizing has finished. Thanks a lot my colleague Palak Chaturvedi for
providing initial draft of the test case.
The patches are rebased on top of the latest master, which includes
changes to remove free buffer list. That led to removing all the code
in these patches dealing with free buffer list.
I am intentionally keeping my changes (patches 0001, 0008 to 0012,
0012 to 0016) separate from Dmitry's changes so that Dmitry can review
them easily. The patches are arranged so that my patches are nearer to
Dmitry's patches, into which, they should be squashed.
Dmitry,
I found that max_available_memory is PGC_SIGHUP. Is that intentional?
I thought it's PGC_POSTMASTER since we can not reserve more address
space without restarting postmaster. Left a TODO for this. I think we
also need to change the name and description to better reflect its
actual functionality.
[0] https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2
[1] https://www.postgresql.org/message-id/CAExHW5v0jh3F_wj86yC%3DqBfWk0uiT94qy%3DZ41uzAHLHh0SerRA%40mail.gmail.com
[2] https://ashutoshpg.blogspot.com/2025/07/efficiency-of-sparse-hash-table.html
[3] https://commitfest.postgresql.org/patch/5997/
--
Best Wishes,
Ashutosh Bapat
From | Date | Subject | |
---|---|---|---|
Next Message | Corey Huinker | 2025-09-18 05:02:17 | Re: someone else to do the list of acknowledgments |
Previous Message | Tom Lane | 2025-09-18 04:52:35 | Re: Reword messages using "as" instead of "because" |