Re: Better shared data structure management and resizable shared data structures

From: Andres Freund <andres(at)anarazel(dot)de>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, chaturvedipalak1911(at)gmail(dot)com
Subject: Re: Better shared data structure management and resizable shared data structures
Date: 2026-02-16 17:56:03
Message-ID: mlsruptoxgm2nqtdfyfsowjklzxl5zltsjb3y5bmywtigm474l@5tsonk4t3kia
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-02-16 20:22:51 +0530, Ashutosh Bapat wrote:
> On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> >
> > On 13/02/2026 13:47, Ashutosh Bapat wrote:
> > > `man madvise` has this
> > > MADV_REMOVE (since Linux 2.6.16)
> > > Free up a given range of pages and its associated
> > > backing store. This is equivalent to punching a
> > > hole in the corresponding byte range of the backing
> > > store (see fallocate(2)). Subsequent accesses
> > > in the specified address range will see bytes containing zero.
> > >
> > > The specified address range must be mapped shared
> > > and writable. This flag cannot be applied to
> > > locked pages, Huge TLB pages, or VM_PFNMAP pages.
> > >
> > > In the initial implementation, only tmpfs(5) was
> > > supported MADV_REMOVE; but since Linux 3.5, any
> > > filesystem which supports the fallocate(2)
> > > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
> > > Hugetlbfs fails with the error EINVAL and other
> > > filesystems fail with the error EOPNOTSUPP.
> > >
> > > It says the flag can not be applied to Huge TLB pages. We won't be
> > > able to make resizable shared memory structures allocated with huge
> > > pages. That seems like a serious restriction.
> >
> > Per https://man7.org/linux/man-pages/man2/madvise.2.html:
> >
> > MADV_REMOVE (since Linux 2.6.16)
> > ...
> >
> > Support for the Huge TLB filesystem was added in Linux
> > v4.3.
> >
> > > I may be misunderstanding something, but it seems like this is useful
> > > to free already allocated memory, not necessarily allocate more
> > > memory. I don't understand how a user would start with a larger
> > > reserved address space with only small portions of that space being
> > > backed by memory.
> >
> > Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
> > to reserve address space for the maximum size, and then
> > madvise(MADV_POPULATE_WRITE) using the initial size. Later,
> > madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
> > again.
>
> Thank you for the hint. Also thanks to Andres's idea, the resizable
> structure patch is quite small now. Actually, after experimenting with
> madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
> is not required at all. We don't have to do anything to expand a
> structure. Memory will be allocated as and when the program writes to
> it.

I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.

> I also discovered things that I didn't know about.
> 1. ftruncate() sets the size of the file but it doesn't allocate the
> memory pages.

Right.

> 2. to use madvise() the address needs to be backed by a file, so
> memfd_create is a must.

I am quite sure that that is not true. I hacked this up with today's
postgres, and the madvise works with the mmap() backed allocation from
sysv_shmem.c, which is anonymous.

What made you conclude that that is the case?

> 4. the address and length passed to madvise needs to be page aligned,
> but that passed to fallocate() needn't be. `man fallocate` says
> "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> range starting at offset and continuing for len bytes. Within the
> specified range, partial filesystem blocks are zeroed, and whole
> filesystem blocks are removed from the file.". It seems to be
> automatically taking care of the page size. So using fallocate()
> simplifies logic. Further `man madvise` says "but since Linux 3.5, any
> filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
> also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
> guaranteed to be available on a system which supports MADV_REMOVE.

I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?

> Using fallocate() (or madvise()) to free memory, we don't need
> multiple segments. So much less code churn compared to the multiple
> mappings approach. However, there is one drawback. In the multiple
> mapping approach access beyond the current size of the structure would
> result in segfault or bus error. But in the fallocate/madvise approach
> such an access does not cause a crash. A write beyond the pages that
> fit the current size of the structure causes more memory to be
> allocated silently. A read returns 0s. So, there's a possibility that
> bugs in size calculations might go unnoticed. I think that's how it
> works even today, access in the yet un-allocated part of the shared
> memory will simply go unnoticed.

If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2026-02-16 18:00:00 Re: race condition in pg_class
Previous Message Andres Freund 2026-02-16 17:50:29 Re: pgstat include expansion