Re: Better shared data structure management and resizable shared data structures

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, chaturvedipalak1911(at)gmail(dot)com
Subject: Re: Better shared data structure management and resizable shared data structures
Date: 2026-02-18 15:47:07
Message-ID: CAExHW5vz+PUHHUuzGRwtyx-mPLQk3nCZXxrFqnruRadEFrO5Xg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 17, 2026 at 5:06 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
> > because otherwise you'll get a SIGBUS when accessing the memory if there is no
> > huge page available anymore.
> >
>
> Ok.
>
> Jakub's experiments [1] showed that fallocate()ing shared memory would
> slow down postmaster start on a slow machine. I suppose the same thing
> applies to MADV_POPULATE_WRITE. And we don't do that today even in the
> case of huge pages; so we already have that problem.
>
> If we perform MADV_POPULATE_WRITE, do we want it only for resizable
> shared memory structures or all the structures in the shared memory?

In the attached patches, I have used MADV_POPULATE_WRITE during
resizing, which is run time operation. When the structures are
allocated when server starts, they are usually initialised, so we end
up allocating memory for the same. So we don't need
MADV_POPULATE_WRITE at that time, and thus avoid affecting startup
slowness, if any. Buffer blocks are not initialised at the time of
starting the server, so their memory is allocated as they are
accessed. But that's how it works today, so no change there.

>
>
> >
> > > 4. the address and length passed to madvise needs to be page aligned,
> > > but that passed to fallocate() needn't be. `man fallocate` says
> > > "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> > > 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> > > range starting at offset and continuing for len bytes. Within the
> > > specified range, partial filesystem blocks are zeroed, and whole
> > > filesystem blocks are removed from the file.". It seems to be
> > > automatically taking care of the page size. So using fallocate()
> > > simplifies logic. Further `man madvise` says "but since Linux 3.5, any
> > > filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
> > > also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
> > > guaranteed to be available on a system which supports MADV_REMOVE.
> >
> > I think it makes no sense to support resizing below page size
> > granularity. What's the point of doing that?
> >
>
> No point really. But we can not control the extensions which want to
> specify a maximum size smaller than a page size. They wouldn't know
> what page size the underlying machine will have, especially with huge
> pages which have a wide range of sizes. Even in the case of shared
> buffers, a value of max_shared_buffers may cause buffer blocks to span
> pages but other structures may fit a page.
>
> In the attached patches, if a resizable structure is such that its
> max_size is smaller than a page size, it is treated as a fixed
> structure with size = max_size. Any request to resize such structures
> will simply update the metadata without actual madvise operation. Only
> the structures whose max_size > page_size would be treated as truly
> resizable and will use madvise. You bring another interesting point.
> If a resizable structure has a maximum size higher than the page size,
> but it is allocated such that the initial part of it is on a partially
> allocated page and the last part of it is on another partially
> allocated page, those pages are never freed because of adjoining
> structures. Per the logic in the attached patches, all the fixed (or
> pseudo-resizable structures) are packed together. The resizable
> structures start on a page boundary and their max_sizes are adjusted
> to be page aligned. That way we can release pages when the structure
> shrinks more than a page.
>

> >
> > > Using fallocate() (or madvise()) to free memory, we don't need
> > > multiple segments. So much less code churn compared to the multiple
> > > mappings approach. However, there is one drawback. In the multiple
> > > mapping approach access beyond the current size of the structure would
> > > result in segfault or bus error. But in the fallocate/madvise approach
> > > such an access does not cause a crash. A write beyond the pages that
> > > fit the current size of the structure causes more memory to be
> > > allocated silently. A read returns 0s. So, there's a possibility that
> > > bugs in size calculations might go unnoticed. I think that's how it
> > > works even today, access in the yet un-allocated part of the shared
> > > memory will simply go unnoticed.
> >
> > If that's something you care about, you can mprotect(PROT_NONE) the relevant
> > regions.
>
> I am fine, if we let go of this protection while getting rid of
> multiple segments, if we all agree to do so.
>
> I could be wrong, but mprotect needs to be executed in every backend
> where the memory is mapped and then a new backend needs to inherit it
> from the postmaster. Makes resizing complex since it has to touch
> every backend. So avoiding mprotect is better.
>

If the general approach in the attached patches looks good, we can
work on improving the 0001 + 0002 to be committable and then work on
0003.

--
Best Wishes,
Ashutosh Bapat

Attachment Content-Type Size
0001-wip-Introduce-a-new-way-of-registering-shar-20260218.patch text/x-patch 53.8 KB
0002-Get-rid-of-global-shared-memory-pointer-mac-20260218.patch text/x-patch 15.0 KB
0003-WIP-resizable-shared-memory-structures-20260218.patch text/x-patch 39.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-02-18 15:50:59 Re: Better shared data structure management and resizable shared data structures
Previous Message Nathan Bossart 2026-02-18 15:46:43 Re: add assertion for palloc in signal handlers