Re: Better shared data structure management and resizable shared data structures

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, chaturvedipalak1911(at)gmail(dot)com
Subject: Re: Better shared data structure management and resizable shared data structures
Date: 2026-02-17 11:36:24
Message-ID: CAExHW5uEK+eeG7e2g6uWh7POrFpfp+dqfaa=_3miMN17zgeaJw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
> because otherwise you'll get a SIGBUS when accessing the memory if there is no
> huge page available anymore.
>

Ok.

Jakub's experiments [1] showed that fallocate()ing shared memory would
slow down postmaster start on a slow machine. I suppose the same thing
applies to MADV_POPULATE_WRITE. And we don't do that today even in the
case of huge pages; so we already have that problem.

If we perform MADV_POPULATE_WRITE, do we want it only for resizable
shared memory structures or all the structures in the shared memory?

On Mon, Feb 16, 2026 at 11:02 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 16/02/2026 16:52, Ashutosh Bapat wrote:
> > 2. to use madvise() the address needs to be backed by a file, so
> > memfd_create is a must.
>
> It seems to work fine for anonymous mmapped memory here. See attached
> test program.
On Mon, Feb 16, 2026 at 11:26 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > 2. to use madvise() the address needs to be backed by a file, so
> > memfd_create is a must.
>
> I am quite sure that that is not true. I hacked this up with today's
> postgres, and the madvise works with the mmap() backed allocation from
> sysv_shmem.c, which is anonymous.
>
> What made you conclude that that is the case?
>

You are right. I was misled by the following sentence in the `man
madvise`: "but since Linux 3.5, any filesystem which supports the
fallocate(2) FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
Filesystems which do not support MADV_REMOVE fail with the error
EOPNOTSUPP." And in a subsequent experiment I dropped MAP_ANONYMOUS
from mmap() and used madvise() which didn't work obviously. My bad.

In the attached patches, I have got rid of memfd_create. That simplifies code.

>
> > 4. the address and length passed to madvise needs to be page aligned,
> > but that passed to fallocate() needn't be. `man fallocate` says
> > "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> > 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> > range starting at offset and continuing for len bytes. Within the
> > specified range, partial filesystem blocks are zeroed, and whole
> > filesystem blocks are removed from the file.". It seems to be
> > automatically taking care of the page size. So using fallocate()
> > simplifies logic. Further `man madvise` says "but since Linux 3.5, any
> > filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
> > also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
> > guaranteed to be available on a system which supports MADV_REMOVE.
>
> I think it makes no sense to support resizing below page size
> granularity. What's the point of doing that?
>

No point really. But we can not control the extensions which want to
specify a maximum size smaller than a page size. They wouldn't know
what page size the underlying machine will have, especially with huge
pages which have a wide range of sizes. Even in the case of shared
buffers, a value of max_shared_buffers may cause buffer blocks to span
pages but other structures may fit a page.

In the attached patches, if a resizable structure is such that its
max_size is smaller than a page size, it is treated as a fixed
structure with size = max_size. Any request to resize such structures
will simply update the metadata without actual madvise operation. Only
the structures whose max_size > page_size would be treated as truly
resizable and will use madvise. You bring another interesting point.
If a resizable structure has a maximum size higher than the page size,
but it is allocated such that the initial part of it is on a partially
allocated page and the last part of it is on another partially
allocated page, those pages are never freed because of adjoining
structures. Per the logic in the attached patches, all the fixed (or
pseudo-resizable structures) are packed together. The resizable
structures start on a page boundary and their max_sizes are adjusted
to be page aligned. That way we can release pages when the structure
shrinks more than a page.

>
> > Using fallocate() (or madvise()) to free memory, we don't need
> > multiple segments. So much less code churn compared to the multiple
> > mappings approach. However, there is one drawback. In the multiple
> > mapping approach access beyond the current size of the structure would
> > result in segfault or bus error. But in the fallocate/madvise approach
> > such an access does not cause a crash. A write beyond the pages that
> > fit the current size of the structure causes more memory to be
> > allocated silently. A read returns 0s. So, there's a possibility that
> > bugs in size calculations might go unnoticed. I think that's how it
> > works even today, access in the yet un-allocated part of the shared
> > memory will simply go unnoticed.
>
> If that's something you care about, you can mprotect(PROT_NONE) the relevant
> regions.

I am fine, if we let go of this protection while getting rid of
multiple segments, if we all agree to do so.

I could be wrong, but mprotect needs to be executed in every backend
where the memory is mapped and then a new backend needs to inherit it
from the postmaster. Makes resizing complex since it has to touch
every backend. So avoiding mprotect is better.

[1] https://www.postgresql.org/message-id/CAKZiRmwxVqEbp7JgOed%3DBCT6cq8RNuHk3N0vuwro65Tsw9E8NA%40mail.gmail.com

PFA patches.

--
Best Wishes,
Ashutosh Bapat

Attachment Content-Type Size
0002-Get-rid-of-global-shared-memory-pointer-mac-20260217.patch text/x-patch 15.0 KB
0001-wip-Introduce-a-new-way-of-registering-shar-20260217.patch text/x-patch 53.8 KB
0003-WIP-resizable-shared-memory-structures-20260217.patch text/x-patch 40.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2026-02-17 11:38:32 Re: Skipping schema changes in publication
Previous Message Zsolt Parragi 2026-02-17 11:21:56 Re: [WIP] Pipelined Recovery