Re: Changing shared_buffers without restart

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Changing shared_buffers without restart
Date: 2025-09-26 18:04:21
Message-ID: yjr3akfujpbsmihfsr3dge4yhlb7xnr3kqp7o4fl4e7mop2vc6@qnzr75xx5o5a
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry for late reply folks.

> On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote:
> > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Sun, 6 Apr 2025 16:40:32 +0200
> > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks
> >
> > Currently an assing hook can perform some preprocessing of a new value,
> > but it cannot change the behavior, which dictates that the new value
> > will be applied immediately after the hook. Certain GUC options (like
> > shared_buffers, coming in subsequent patches) may need coordinating work
> > between backends to change, meaning we cannot apply it right away.
> >
> > Add a new flag "pending" for an assign hook to allow the hook indicate
> > exactly that. If the pending flag is set after the hook, the new value
> > will not be applied and it's handling becomes the hook's implementation
> > responsibility.
>
> I doubt it makes sense to add this to the GUC system. I think it'd be better
> to just use the GUC value as the desired "target" configuration and have a
> function or a show-only GUC for reporting the current size.
>
> I don't think you can't just block application of the GUC until the resize is
> complete. E.g. what if the value was too big and the new configuration needs
> to fixed to be lower?

I think it was a bit hasty to post another version of the patch without
the design changes we've agreed upon last time. I'm still working on
that (sorry, it takes time, I haven't wrote so much Perl for testing
since forever), the current implementation doesn't include anything with
GUC to simplify the discussion. I'm still convinced that multi-step GUC
changing makes sense, but it has proven to be more complicated than I
anticipated, so I'll spin up another thread to discuss when I come to
it.

> > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Fri, 4 Apr 2025 21:46:14 +0200
> > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration
> >
> > Currently WaitForProcSignalBarrier allows to make sure the message sent
> > via EmitProcSignalBarrier was processed by all ProcSignal mechanism
> > participants.
> >
> > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
> > which will be updated when a process has received the message, but not
> > processed it yet. This makes it possible to support a new mode of
> > waiting, when ProcSignal participants want to synchronize message
> > processing. To do that, a participant can wait via
> > WaitForProcSignalBarrierReceived when processing a message, effectively
> > making sure that all processes are going to start processing
> > ProcSignalBarrier simultaneously.
>
> I doubt "online resizing" that requires synchronously processing the same
> event, can really be called "online". There can be significant delays in
> processing a barrier, stalling the entire server until that is reached seems
> like a complete no-go for production systems?
>
> [...]

> As mentioned above, this basically makes the entire feature not really
> online. Besides the latency of some processes not getting to the barrier
> immediately, there's also the issue that actually reserving large amounts of
> memory can take a long time - during which all processes would be unavailable.
>
> I really don't see that being viable. It'd be one thing if that were a
> "temporary" restriction, but the whole design seems to be fairly centered
> around that.
>
> [...]
>
> Besides not really being online, isn't this a recipe for endless undetected
> deadlocks? What if process A waits for a lock held by process B and process B
> arrives at the barrier? Process A won't ever get there, because process B
> can't make progress, because A is not making progress.

Same as above, in the version I'm working right now it's changed in
favor of an approach that looks more like the one from "online checksum
change" patch. I've even stumbled upon a cases when a process was just
killed and never arrive at the barrier, so that was it. The new approach
makes certain parts simpler, but requires managing backends with
different understanding of how large shared memory segments are for some
time interval. Introducing a new parameter "number of available buffers"
seems to be helpful to address all cases I've found so far.

Btw, under "online" resizing I mostly understood "without restart", the
goal was not to make it really "online".

> > -#define MAX_ON_EXITS 20
> > +#define MAX_ON_EXITS 40
>
> Why does a patch like this contain changes like this mixed in with the rest?
> That's clearly not directly related to $subject.

An artifact of rebasing, it belonged to 0007.

> > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Tue, 17 Jun 2025 11:47:04 +0200
> > Subject: [PATCH 06/16] Address space reservation for shared memory
> >
> > Currently the shared memory layout is designed to pack everything tight
> > together, leaving no space between mappings for resizing. Here is how it
> > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
> > anonymous shared memory we talk about:
> >
> > 00400000-00490000 /path/bin/postgres
> > ...
> > 012d9000-0133e000 [heap]
> > 7f443a800000-7f470a800000 /dev/zero (deleted)
> > 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
> > 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
> > ...
> >
> > Make the layout more dynamic via splitting every shared memory segment
> > into two parts:
> >
> > * An anonymous file, which actually contains shared memory content. Such
> > an anonymous file is created via memfd_create, it lives in memory,
> > behaves like a regular file and semantically equivalent to an
> > anonymous memory allocated via mmap with MAP_ANONYMOUS.
> >
> > * A reservation mapping, which size is much larger than required shared
> > segment size. This mapping is created with flags PROT_NONE (which
> > makes sure the reserved space is not used), and MAP_NORESERVE (to not
> > count the reserved space against memory limits). The anonymous file is
> > mapped into this reservation mapping.
>
> The commit message fails to explain why, if we're already relying on
> MAP_NORESERVE, we need to anything else? Why can't we just have one maximally
> sized allocation that's marked MAP_NORESERVE for all the parts that we don't
> yet need?

How do we return memory to the OS in that case? Currently it's done
explicitly via truncating the anonymous file.

> > * The file could be given a name, which improves readability when it
> > comes to process maps.
>
> > * By default, Linux will not add file-backed shared mappings into a core dump,
> > making it more convenient to work with them in PostgreSQL: no more huge dumps
> > to process.
>
> That's just as well a downside, because you now can't investigate some
> issues. This was already configurable via coredump_filter.

This behaviour is configured via coredump_filter as well, so just the
default value has been changed.

> > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001
> > From: Dmitrii Dolgov <9erthalion6(at)gmail(dot)com>
> > Date: Tue, 17 Jun 2025 11:22:02 +0200
> > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers
> >
> > Add more shmem segments to split shared buffers into following chunks:
> > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
> > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
> > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
> > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
> > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status
>
> Why do all these need to be separate segments? Afaict we'll have to maximally
> size everything other than BUFFERS_SHMEM_SEGMENT at start?

Why would they need to me maxed out at the start? So far my rule of
thumb was one segment for one structure which size depends on NBuffers,
so that when changing NBuffers each segment could be adjusted
independently.

> > +-- Test 2: Set to 64MB
> > +ALTER SYSTEM SET shared_buffers = '64MB';
> > +SELECT pg_reload_conf();
> > +SELECT pg_sleep(1);
> > +SHOW shared_buffers;
>
> Tests containing sleeps are a significant warning flag imo.

Tests I'm preparing so far avoiding this by waiting in injection points.
I haven't found anything similar in existing tests, but I assume such
approach is fine.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zsolt Parragi 2025-09-26 18:07:15 Limit eartdistance regression testcase to the public schema
Previous Message Tom Lane 2025-09-26 17:53:32 Re: Xact end leaves CurrentMemoryContext = TopMemoryContext