| From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, chaturvedipalak1911(at)gmail(dot)com |
| Subject: | Re: Changing shared_buffers without restart |
| Date: | 2026-02-10 12:50:54 |
| Message-ID: | CAKZiRmx-ycn+TT3_n97K40aNf4Ug0V5ywi3wu9p7fFwkWO+udg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Feb 9, 2026 at 3:29 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-02-09 14:41:12 +0100, Jakub Wartak wrote:
> > I've thought that the potential main reason of the hit would be slow fork(),
> > so I had an idea why we fork() with majority of memory being shared_buffers
> > (BufferBlocks) that is not really used inside postmaster itself
> > (I mean it does not use it, only backends do use it). I've thought it could
> > be cool if we could just init the memory, leave just the fd from memfd_create
> > for s_b around (that is unmap() BufferBlocks from the postmaster thus lowering
> > its RSS/smaps footprint) and then on fork() the fork() would NOT have to copy
> > that big kernel VMA for shared_buffers. Instead (in theory - only the fd that
> > is the reference - thereby we could increase the scalability of the postmaster
> > (kernel would need to perform less work during fork()). Later on, the classic
> > backends on their side would mmap() the region back from the fd created earlier
> > (in postmaster) using memfd_create(2), but that would happen as part of many
> > backends (so workload would be spread across many CPUs).
>
> FWIW, when looking at this in the past there were two noteworthy things:
>
> 1) The main driver of slowness was *NOT* shared buffers, but all the libraries
> we link to. Particularly openssl makes things a *lot* slower, due to all
> the small mappings it creates. If you compare the fork speed of a postgres
> with minimal dependencies and one with all the dependencies, you'll see a
> huge difference.
>
> The reason that openssl is so bad is that it modifies data in all the
> copy-on-write mappings during process exit processing. See [1].
>
>
> 2) A lot of the slowness isn't actually from the fork overhead itself, but
> from fork competing with the processing during process exit, as both taking
> conflicting locks.
Interesting, thanks for sharing this. I've studied fork() itself a
little bit more
(the fork() vs various factors without crazy exit() handlers). See attached
results from 2 machines or just run fork_bench C proggie. My conclusions on
on 6.14.x are following (those are mostly notes for myself while
studying those, but I
think I'll share, maybe just one variable is missing here: how fork() ends up
being affected by NUMA - future TODO for me ):
MAP_SHARED (findings for this $thread)
--------------------------------------
a) In "mmap-MAP_SHARED" cases, the max number of fork()/s drops but very
slightly as the number of (still only MAP_SHARED!) segments increase. This
applies to both with huge pages and without them. Memfd_normal seems to
behave almost in identical way, so at least from that angle the patch seems
to be ok (assuming it has just two segments today, yesterday it had 6 for
me ;))
b) My wild trick/assumption - not related to $thread - under "memfd_unmap"
that I've posted earlier - assuming it will double postmaster scalability -
is double fizzled right now, as you say the overhead of unmapping segments
Before fork()ing and keeping just mem fd to restore that mmap MAP_SHARED
segment from child for some reason degrades performance compared to just
letting them persist or using MADV_DONTNEED. Probably it's page faulting
as you say, I haven't measured that. RIP idea.
MAP_PRIVATE (this can be ignored for the purposes of this $thread)
------------------------------------------------------------------
Nevertheless quite interesting to see how those two modes compare and it
touches aspect of openssl and e.g. io_uring using to create many VMAs too
[1]
c) MAP_PRIVATE seems to be way slower because fork() must copy PTEs and
mark them as CoW. Performance drops as the total memory (number of pages)
increases. We should not have big MAP_ANONYMOUS|MAP_PRIVATE segments (or
even just many segments [1]) in the postmaster if we want fast fork().
But even still having a lot of MAP_PRIVATE (in some edge case? large
heap?), really benefits from huge pages there.
My takeaway from this is - and it's unrelated to this $thread, but still
interesting finding for future: once we'll have multithreading, we
might be not able to fork() efficiently from there (or it will be big
huge impact for MAP_PRIVATE/big heap for all threads). It will clearly
depend on the architecture: but if postmaster will be removed and one
a giant PID will have multiple TIDs and somebody does want to run COPY
TO/FROM PROGRAM often from there, we are screwed unless those segments will
be MADV_DONOTFORK.
> I seriously doubt it's a good idea to delay the mmapping until after the fork,
> that'll just lead to more different mappings to exist that then all need to be
> tracked separately by the kernel.
Right, the raw numbers are not showing this as a good idea.
-J.
| Attachment | Content-Type | Size |
|---|---|---|
| laptop_hugepages.txt | text/plain | 4.4 KB |
| 1s4c4t__hugepages.txt | text/plain | 4.4 KB |
| 1s4c4t__nohugepages.txt | text/plain | 4.4 KB |
| laptop_nohugepages.txt | text/plain | 4.4 KB |
| fork_bench6.c | text/x-csrc | 4.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nitin Motiani | 2026-02-10 13:09:18 | Re: [PATCH] Support reading large objects with pg_read_all_data |
| Previous Message | abdelsalam mostafa | 2026-02-10 12:40:18 | PL/Julia: clarification on IN array parameters issue |