From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Changing shared_buffers without restart |
Date: | 2025-06-10 11:09:58 |
Message-ID: | CAExHW5v9cE+ETusTafZyvy+eVpnVvoayVm7ZOk4Ddq8fxY270A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Apr 21, 2025 at 7:47 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
> wrote:
> > Yeah, that would work and will allow to avoid MAP_FIXED and mremap,
> which are
> > questionable from portability point of view. This leaves memfd_create,
> and I'm
> > still not completely clear on it's portability -- it seems to be
> specific to
> > Linux, but others provide compatible implementation as well.
>
> Something like this should work, roughly based on DSM code except here
> we don't really need the name so we unlink it immediately, at the
> slight risk of leaking it if the postmaster is killed between those
> lines (maybe someone should go and tell POSIX to support the special
> name SHM_ANON or some other way to avoid that; I can't see any
> portable workaround). Not tested/compiled, just a sketch:
>
> #ifdef HAVE_MEMFD_CREATE
> /* Anonymous shared memory region. */
> fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
> #else
> /* Standard POSIX insists on a name, which we unlink immediately. */
> do
> {
> char tmp[80];
> snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
> pg_prng_uint32(&pg_global_prng_state));
> fd.= shm_open(tmp, O_CREAT | O_EXCL);
> if (fd >= 0)
> shm_unlink(tmp);
> } while (fd < 0 && errno == EXIST);
> #endif
>
> > Let me experiment with this idea a bit, I would like to make sure there
> are no
> > other limitations we might face.
>
> One thing I'm still wondering about is whether you really need all
> this multi-phase barrier stuff, or even need to stop other backends
> from running at all while doing the resize. I guess that's related to
> your remapping scheme, but supposing you find the simple
> ftruncate()-only approach to be good, my next question is: why isn't
> it enough to wait for all backends to agree to stop allocating new
> buffers in the range to be truncated, and then left them continue to
> run as normal? As far as they would be concerned, the in-progress
> downsize has already happened, though it could be reverted later if
> the eviction phase fails. Then the coordinator could start evicting
> buffers and truncating the shared memory object, which are
> phases/steps, sure, but it's not clear to me why they need other
> backends' help.
>
AFAIU, we required the phased approach since mremap needed to happen in
every backend after buffer eviction but before making modifications to the
shared memory. If we don't need to call mremap in every backend and just
ftruncate + initializing memory (when expanding buffers) is enough, I think
phased approach isn't needed. But I haven't tried it myself.
Here's patchset rebased on 3feff3916ee106c084eca848527dc2d2c3ef4e89.
0001 - 0008 are same as the previous patchset
0009 adds support to shrink shared buffers. It has two changes: a. evict
the buffers outside the new buffer size b. remove buffers with buffer id
outside the new buffer size from the free list. If a buffer being evicted
is pinned, the operation is aborted and a FATAL error is raised. I think we
need to change this behaviour to be less severe like rolling back the
operation or waiting for the pinned buffer to be unpinned etc. Better even
if we could let users control the behaviour. But we need better
infrastructure to do such things. That's one TODO left in the patch.
0010 is about reinitializing the Strategy reinitialization. Once we expand
the buffers, the new buffers need to be added to the free list. Some
StrategyControl area members (not all) need to be adjusted. That's what
this patch does. But a deeper adjustment in BgBufferSync() and
ClockSweepTick() is required. Further we need to do something about the
buffer lookup table. More on that later in the email.
0011-0012 fix compilation issues in these patches but those fixes are not
correct. The patches are there so that binaries can be built without any
compilation issues and someone can experiment with buffer resizing. Good
thing is the compilation fixes are in SQL callable functions
pg_get_shmem_pagesize() and pg_get_shmem_numa(). So there's no ill-effect
because of these patches as long as those two functions are not called.
Buffer lookup table resizing
------------------------------------
The size of the buffer lookup table depends upon (number of shared
buffers + number of partitions in the shared buffer lookup table). If we
shrink the buffer pool, the buffer lookup table will become sparse but
still useful. If we expand the buffers we need to expand the buffer lookup
table too. That's not implemented in the current patchset. There are two
solutions here:
1. We map a lot of extra address space (not memory) initially to
accomodate for future expansion of shared buffer pool. Let's say that the
total address space is sufficient to accomodate Nx buffers. Simple solution
is to allocate a buffer lookup table with Nx initial entries so that we
don't have to resize the buffer lookup table ever. It will waste memory but
we might be ok with that as version 1 solution. According to my offline
discussion with David Rowley, buffer lookups in sparse hash tables are
inefficient because or more cacheline faults. Whether that translates to
any noticeable performance degradation in TPS needs to be measured.
2. Alternate solution is to resize the buffer mapping table as well. This
means that we rehash all the entries again which may take a longer time and
the partitions will remain locked for that amount of time. Not to mention
this will require non-trivial change to dynahash implementation.
Next I will look at BgBufferSync() and ClockSweepTick() adjustments and
then buffer lookup table fix with approach 1.
--
Best Wishes,
Ashutosh Bapat
From | Date | Subject | |
---|---|---|---|
Next Message | Florents Tselai | 2025-06-10 11:36:53 | Re: Feature: psql - display current search_path in prompt |
Previous Message | Michael Banck | 2025-06-10 10:59:16 | Re: Cleaning up historical portability baggage |