| From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
| Cc: | Tomas Vondra <tomas(at)vondra(dot)me>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, chaturvedipalak1911(at)gmail(dot)com, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: Changing shared_buffers without restart |
| Date: | 2026-02-10 06:17:19 |
| Message-ID: | CAExHW5vEfDQuqgV0Z_8=5htZTt186VioD+d2YtszywegAag5=Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Feb 9, 2026 at 7:11 PM Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
>
> On Wed, Jan 28, 2026 at 2:19 PM Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> >v 20260128*.patch
>
> Short intro: I've started trying out these patches for slightly another reason
> than the online buffers resize. There's was recent post [1] that was brought to
> attention by Alvaro. That article is complaining about postmaster being
> unscalable and more or less saturating @ 2-3k new connections / second and
> postmaster becoming a CPU hog (one could argue that's too much and not sensible
> setup).
>
> I've thought that the potential main reason of the hit would be slow fork(),
> so I had an idea why we fork() with majority of memory being shared_buffers
> (BufferBlocks) that is not really used inside postmaster itself
> (I mean it does not use it, only backends do use it). I've thought it could
> be cool if we could just init the memory, leave just the fd from memfd_create
> for s_b around (that is unmap() BufferBlocks from the postmaster thus lowering
> its RSS/smaps footprint) and then on fork() the fork() would NOT have to copy
> that big kernel VMA for shared_buffers. Instead (in theory - only the fd that
> is the reference - thereby we could increase the scalability of the postmaster
> (kernel would need to perform less work during fork()). Later on, the classic
> backends on their side would mmap() the region back from the fd created earlier
> (in postmaster) using memfd_create(2), but that would happen as part of many
> backends (so workload would be spread across many CPUs). The critical
> assumption here is that although on Linux there seems to be huge PMD sharing for
> MAP_SHARED | MAP_HUGETLB, I was still wondering if we couldn't accelerate it
> further by simply not having at all this memory before calling fork().
> Initially I've created simple PoC bench on 64GB even with hugepages showed some
> potential:
> Scenario 1 (mmap inherited): 20001 total forks, 0.302ms per fork
> Scenario 2 (MADV_DONTFORK): 20001 total forks, 0.292ms per fork
> Scenario 3 (memfd_create): 20002 total forks, 0.145ms per fork
>
> Quite unexpectedly that's how I discovered Your's and Dimitry's patch
> as it already
> had separation of memory segments (rather than one big mmap() blob) and
> memfd_create(2) used too, so I just gave it a try. So I've tried to benchmark
> Your's patchset when it comes to establishing new connections:
>
> 1s4c 32GB RAM, 6.14.x kernel, 16GB shared_buffers
> benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 4 -c 100
> -f <(echo "SELECT 1;") postgres -P 1 -T 30
>
> # master
> latency average = 358.681 ms
> latency stddev = 225.813 ms
> average connection time = 2.989 ms
> tps = 1329.733460 (including reconnection times)
>
> # memfd/thispatchset
> latency average = 363.584 ms
> latency stddev = 230.529 ms
> average connection time = 3.022 ms
> tps = 1315.810761 (including reconnection times)
>
> # memfd+mytrick, showed some promise in low stddev, but not in TPS
> latency average = 34.229 ms
> latency stddev = 22.059 ms
> average connection time = 2.908 ms
> tps = 1369.785773 (including reconnection times)
>
> Another box, 4s32c64, 128GB RAM, 6.14.x kernel,
> 64GB shared_buffers (4 NUMA nodes)
>
> benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 128 -c 1000
> -f <(echo "SELECT 1;") postgres -P 1 -T 30
>
> #master
> latency average = 240.179 ms
> latency stddev = 119.379 ms
> average connection time = 62.049 ms
> tps = 2058.434343 (including reconnection times)
>
> #memfd
> latency average = 268.384 ms
> latency stddev = 133.501 ms
> average connection time = 69.081 ms
> tps = 1847.422995 (including reconnection times)
>
> #memfd+mytrick
> latency average = 261.726 ms
> latency stddev = 130.161 ms
> average connection time = 67.579 ms
> tps = 1889.988400 (including reconnection times)
>
Thanks for the benchmarks. I can see
1. There's isn't much impact of having multiple segments on new connection time.
2. fallocate seems to be behind the regression on machine with 4 NUMA nodes.
Am I reading it correctly?
The latest patches 20260209 use only two segments. Please check if
that improves the situation further.
> So:
> a) yes, my idea fizzled - still no crystal clear idea why - but at least
> I've tried Your's patch :) We are still in the ballpark of ~1800..3000
> new connections per second.
>
> and here proper review against patchset follows:
> b) the patch changes the behavior on startup and it appears that now
> the patch tries to touch all the memory during startup which takes
> much more time (I'm thinking of HA failover/promote scenarios where
> long startup on could mean trouble e.g. after pg_rewind). E.g. without
> patch it takes 1-2s and with the patch it takes 49s, no HugePages with
> 64GB s_b on slow machine). It happens due to that new fallocate() from
> shmem_fallocate(). If it is supposed to stay like that IMHO log should
> elog() what it is doing ("allocating memory...", otherwise users can
> be left confused. It almost behaves like MAP_POPULATE would be
> used.
>
> c) as per above measurements, on NUMA it appears that there's seems be
> like 1847/2058=~89% of baseline regression, when it comes to the
> establishing new connections and you are operating on sysv_shmem.c
> (so affecting all users). Possibly this would have to be re-tested
> on some more modern hardware (I don't see it on single socket, but I
> see on multiple sockets)
I have added a TODO in the code to investigate this case later as we
fine tune the code.
>
> d) MADV_HUGEPAGES is Linux 4.14+ and although released nearly 10
> years ago the buildfarm probably has some animals (Ubuntu 16?) that
> still use such
> old kernels (??))
>
> e) so maybe because of b+c+d we should consider putting it under some new
> shared_memory_type in the long run?
That may be a good idea so as to avoid hitting segfault at run time
because of lack of memory to back the shared memory.
>
> e) With huge_pages=on and no asserts it seemed to never work for me due to:
> FATAL: segment[main]: could not truncate anonymous file to
> size 313483264: Invalid argument
> and please see this (this is with both(!)
> max_shared_buffers=shared_buffers=1GB),
> for some reason ftruncate() ended up calling ~ 2x more.
> [pid 1252287] memfd_create("main", MFD_HUGETLB) = 4
> [pid 1252287] mmap(NULL, 157286400, PROT_NONE, MAP_SHARED|MAP_NORESE..
> [pid 1252287] mprotect(0x7f2a1a400000, 157286400, PROT_READ|PROT_WRI..
> [pid 1252287] ftruncate(4, 313483264) = -1 EINVAL (Invalid argument)
> it appears that I'm getting this due to bug in
> round_off_mapping_sizes_for_hugepages() as before it I'm getting:
> shmem_reserved=156196864, shmem_req_size=156196864
> and after it it's called it returning:
> shmem_reserved=157286400, shmem_req_size=313483264
> Maybe TYPE ALIGN() would be a better fit for this there.
>
I see the bug. Fixed in the attached diff. Please apply it on top of
20260209 and let me know if it fixes the issue for you. I will include
it in the next set of patches.
--
Best Wishes,
Ashutosh Bapat
| Attachment | Content-Type | Size |
|---|---|---|
| huge_page_fix.diff.no_ci | application/octet-stream | 3.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Maxim Orlov | 2026-02-10 06:18:48 | Re: Add 64-bit XIDs into PostgreSQL 15 |
| Previous Message | shveta malik | 2026-02-10 06:05:13 | Re: Skipping schema changes in publication |