Quick Links

Re: Changing shared_buffers without restart

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Changing shared_buffers without restart
Date:	2025-09-18 13:52:03
Message-ID:	qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u@yaogvphfwc4h
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-09-18 10:25:29 +0530, Ashutosh Bapat wrote:
> From d1ed934ccd02fca2c831e582b07a169e17d19f59 Mon Sep 17 00:00:00 2001
> From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> Date: Tue, 17 Jun 2025 15:14:33 +0200
> Subject: [PATCH 02/16] Process config reload in AIO workers

I think this is superfluous due to b8e1f2d96bb9

> Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS,
> which does not include ConfigReloadPending. Thus we need to check for it
> explicitly.

> +/*
> + * Process any new interrupts.
> + */
> +static void
> +pgaio_worker_process_interrupts(void)
> +{
> + /*
> + * Reloading config can trigger further signals, complicating interrupts
> + * processing -- so let it run first.
> + *
> + * XXX: Is there any need in memory barrier after ProcessConfigFile?
> + */
> + if (ConfigReloadPending)
> + {
> + ConfigReloadPending = false;
> + ProcessConfigFile(PGC_SIGHUP);
> + }
> +
> + if (ProcSignalBarrierPending)
> + ProcessProcSignalBarrier();
> +}

Given that even before b8e1f2d96bb9 method_worker.c used
CHECK_FOR_INTERRUPTS(), which contains a ProcessProcSignalBarrier(), I don't
know why that second check was added here?

> From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
> From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> Date: Sun, 6 Apr 2025 16:40:32 +0200
> Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
>
> Currently an assing hook can perform some preprocessing of a new value,
> but it cannot change the behavior, which dictates that the new value
> will be applied immediately after the hook. Certain GUC options (like
> shared_buffers, coming in subsequent patches) may need coordinating work
> between backends to change, meaning we cannot apply it right away.
>
> Add a new flag "pending" for an assign hook to allow the hook indicate
> exactly that. If the pending flag is set after the hook, the new value
> will not be applied and it's handling becomes the hook's implementation
> responsibility.

I doubt it makes sense to add this to the GUC system. I think it'd be better
to just use the GUC value as the desired "target" configuration and have a
function or a show-only GUC for reporting the current size.

I don't think you can't just block application of the GUC until the resize is
complete. E.g. what if the value was too big and the new configuration needs
to fixed to be lower?

> From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
> From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> Date: Fri, 4 Apr 2025 21:46:14 +0200
> Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
>
> Currently WaitForProcSignalBarrier allows to make sure the message sent
> via EmitProcSignalBarrier was processed by all ProcSignal mechanism
> participants.
>
> Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
> which will be updated when a process has received the message, but not
> processed it yet. This makes it possible to support a new mode of
> waiting, when ProcSignal participants want to synchronize message
> processing. To do that, a participant can wait via
> WaitForProcSignalBarrierReceived when processing a message, effectively
> making sure that all processes are going to start processing
> ProcSignalBarrier simultaneously.

I doubt "online resizing" that requires synchronously processing the same
event, can really be called "online". There can be significant delays in
processing a barrier, stalling the entire server until that is reached seems
like a complete no-go for production systems?

> From 63fe27340656c52b13f4eecebd9e73d24efe5e33 Mon Sep 17 00:00:00 2001
> From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> Date: Fri, 28 Feb 2025 19:54:47 +0100
> Subject: [PATCH 05/16] Allow to use multiple shared memory mappings
>
> Currently all the work with shared memory is done via a single anonymous
> memory mapping, which limits ways how the shared memory could be organized.
>
> Introduce possibility to allocate multiple shared memory mappings, where
> a single mapping is associated with a specified shared memory segment.
> There is only fixed amount of available segments, currently only one
> main shared memory segment is allocated. A new shared memory API is
> introduces, extended with a segment as a new parameter. As a path of
> least resistance, the original API is kept in place, utilizing the main
> shared memory segment.

> -#define MAX_ON_EXITS 20
> +#define MAX_ON_EXITS 40

Why does a patch like this contain changes like this mixed in with the rest?
That's clearly not directly related to $subject.

> /* shared memory global variables */
>
> -static PGShmemHeader *ShmemSegHdr; /* shared mem segment header */
> +ShmemSegment Segments[ANON_MAPPINGS];
>
> -static void *ShmemBase; /* start address of shared memory */
> -
> -static void *ShmemEnd; /* end+1 address of shared memory */
> -
> -slock_t *ShmemLock; /* spinlock for shared memory and LWLock
> - * allocation */
> -
> -static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */

Why do we need a separate ShmemLock for each segment? Besides being
unnecessary, it seems like that prevents locking in a way that provides
consistency across all segments.

> From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
> From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> Date: Tue, 17 Jun 2025 11:47:04 +0200
> Subject: [PATCH 06/16] Address space reservation for shared memory
>
> Currently the shared memory layout is designed to pack everything tight
> together, leaving no space between mappings for resizing. Here is how it
> looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
> anonymous shared memory we talk about:
>
> 00400000-00490000 /path/bin/postgres
> ...
> 012d9000-0133e000 [heap]
> 7f443a800000-7f470a800000 /dev/zero (deleted)
> 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
> 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
> ...
>
> Make the layout more dynamic via splitting every shared memory segment
> into two parts:
>
> * An anonymous file, which actually contains shared memory content. Such
> an anonymous file is created via memfd_create, it lives in memory,
> behaves like a regular file and semantically equivalent to an
> anonymous memory allocated via mmap with MAP_ANONYMOUS.
>
> * A reservation mapping, which size is much larger than required shared
> segment size. This mapping is created with flags PROT_NONE (which
> makes sure the reserved space is not used), and MAP_NORESERVE (to not
> count the reserved space against memory limits). The anonymous file is
> mapped into this reservation mapping.

The commit message fails to explain why, if we're already relying on
MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
sized allocation that's marked MAP_NORESERVE for all the parts that we don't
yet need?

> There are also few unrelated advantages of using anon files:
>
> * We've got a file descriptor, which could be used for regular file
> operations (modification, truncation, you name it).

What is this an advantage for?

> * The file could be given a name, which improves readability when it
> comes to process maps.

> * By default, Linux will not add file-backed shared mappings into a core dump,
> making it more convenient to work with them in PostgreSQL: no more huge dumps
> to process.

That's just as well a downside, because you now can't investigate some
issues. This was already configurable via coredump_filter.

> From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
> From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> Date: Tue, 17 Jun 2025 11:22:02 +0200
> Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers
>
> Add more shmem segments to split shared buffers into following chunks:
> * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
> * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
> * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
> * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
> * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Why do all these need to be separate segments? Afaict we'll have to maximally
size everything other than BUFFERS_SHMEM_SEGMENT at start?

> From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001
> From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> Date: Tue, 17 Jun 2025 14:16:55 +0200
> Subject: [PATCH 11/16] Allow to resize shared memory without restart
>
> Add assing hook for shared_buffers to resize shared memory using space,
> introduced in the previous commits without requiring PostgreSQL restart.
> Essentially the implementation is based on two mechanisms: a
> ProcSignalBarrier is used to make sure all processes are starting the
> resize procedure simultaneously, and a global Barrier is used to
> coordinate after that and make sure all finished processes are waiting
> for others that are in progress.
>
> The resize process looks like this:
>
> * The GUC assign hook sets a flag to let the Postmaster know that resize
> was requested.
>
> * Postmaster verifies the flag in the event loop, and starts the resize
> by emitting a ProcSignal barrier.
>
> * All processes, that participate in ProcSignal mechanism, begin to
> process ProcSignal barrier. First a process waits until all processes
> have confirmed they received the message and can start simultaneously.

As mentioned above, this basically makes the entire feature not really
online. Besides the latency of some processes not getting to the barrier
immediately, there's also the issue that actually reserving large amounts of
memory can take a long time - during which all processes would be unavailable.

I really don't see that being viable. It'd be one thing if that were a
"temporary" restriction, but the whole design seems to be fairly centered
around that.

> * Every process recalculates shared memory size based on the new
> NBuffers, adjusts its size using ftruncate and adjust reservation
> permissions with mprotect. One elected process signals the postmaster
> to do the same.

If we just used a single memory mapping with all unused parts marked
MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other
work in this patchset)..

> From experiment it turns out that shared mappings have to be extended
> separately for each process that uses them. Another rough edge is that a
> backend blocked on ReadCommand will not apply shared_buffers change
> until it receives something.

That's not a rough edge, that basically makes the feature unusable, no?

> +-- Test 2: Set to 64MB
> +ALTER SYSTEM SET shared_buffers = '64MB';
> +SELECT pg_reload_conf();
> +SELECT pg_sleep(1);
> +SHOW shared_buffers;

Tests containing sleeps are a significant warning flag imo.

Greetings,

Andres Freund

In response to

Re: Changing shared_buffers without restart at 2025-09-18 04:55:29 from Ashutosh Bapat

Responses

Re: Changing shared_buffers without restart at 2025-09-18 14:05:01 from Andres Freund
Re: Changing shared_buffers without restart at 2025-10-13 15:58:09 from Ashutosh Bapat

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2025-09-18 13:54:46	Re: Invalid primary_slot_name triggers warnings in all processes on reload
Previous Message	Mark Dilger	2025-09-18 13:36:02	Re: [PATCH] Check that index can return in get_actual_variable_range()