| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, chaturvedipalak1911(at)gmail(dot)com |
| Subject: | Re: Better shared data structure management and resizable shared data structures |
| Date: | 2026-02-16 17:56:03 |
| Message-ID: | mlsruptoxgm2nqtdfyfsowjklzxl5zltsjb3y5bmywtigm474l@5tsonk4t3kia |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-02-16 20:22:51 +0530, Ashutosh Bapat wrote:
> On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> >
> > On 13/02/2026 13:47, Ashutosh Bapat wrote:
> > > `man madvise` has this
> > > MADV_REMOVE (since Linux 2.6.16)
> > > Free up a given range of pages and its associated
> > > backing store. This is equivalent to punching a
> > > hole in the corresponding byte range of the backing
> > > store (see fallocate(2)). Subsequent accesses
> > > in the specified address range will see bytes containing zero.
> > >
> > > The specified address range must be mapped shared
> > > and writable. This flag cannot be applied to
> > > locked pages, Huge TLB pages, or VM_PFNMAP pages.
> > >
> > > In the initial implementation, only tmpfs(5) was
> > > supported MADV_REMOVE; but since Linux 3.5, any
> > > filesystem which supports the fallocate(2)
> > > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
> > > Hugetlbfs fails with the error EINVAL and other
> > > filesystems fail with the error EOPNOTSUPP.
> > >
> > > It says the flag can not be applied to Huge TLB pages. We won't be
> > > able to make resizable shared memory structures allocated with huge
> > > pages. That seems like a serious restriction.
> >
> > Per https://man7.org/linux/man-pages/man2/madvise.2.html:
> >
> > MADV_REMOVE (since Linux 2.6.16)
> > ...
> >
> > Support for the Huge TLB filesystem was added in Linux
> > v4.3.
> >
> > > I may be misunderstanding something, but it seems like this is useful
> > > to free already allocated memory, not necessarily allocate more
> > > memory. I don't understand how a user would start with a larger
> > > reserved address space with only small portions of that space being
> > > backed by memory.
> >
> > Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
> > to reserve address space for the maximum size, and then
> > madvise(MADV_POPULATE_WRITE) using the initial size. Later,
> > madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
> > again.
>
> Thank you for the hint. Also thanks to Andres's idea, the resizable
> structure patch is quite small now. Actually, after experimenting with
> madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
> is not required at all. We don't have to do anything to expand a
> structure. Memory will be allocated as and when the program writes to
> it.
I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages,
because otherwise you'll get a SIGBUS when accessing the memory if there is no
huge page available anymore.
> I also discovered things that I didn't know about.
> 1. ftruncate() sets the size of the file but it doesn't allocate the
> memory pages.
Right.
> 2. to use madvise() the address needs to be backed by a file, so
> memfd_create is a must.
I am quite sure that that is not true. I hacked this up with today's
postgres, and the madvise works with the mmap() backed allocation from
sysv_shmem.c, which is anonymous.
What made you conclude that that is the case?
> 4. the address and length passed to madvise needs to be page aligned,
> but that passed to fallocate() needn't be. `man fallocate` says
> "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> range starting at offset and continuing for len bytes. Within the
> specified range, partial filesystem blocks are zeroed, and whole
> filesystem blocks are removed from the file.". It seems to be
> automatically taking care of the page size. So using fallocate()
> simplifies logic. Further `man madvise` says "but since Linux 3.5, any
> filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
> also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
> guaranteed to be available on a system which supports MADV_REMOVE.
I think it makes no sense to support resizing below page size
granularity. What's the point of doing that?
> Using fallocate() (or madvise()) to free memory, we don't need
> multiple segments. So much less code churn compared to the multiple
> mappings approach. However, there is one drawback. In the multiple
> mapping approach access beyond the current size of the structure would
> result in segfault or bus error. But in the fallocate/madvise approach
> such an access does not cause a crash. A write beyond the pages that
> fit the current size of the structure causes more memory to be
> allocated silently. A read returns 0s. So, there's a possibility that
> bugs in size calculations might go unnoticed. I think that's how it
> works even today, access in the yet un-allocated part of the shared
> memory will simply go unnoticed.
If that's something you care about, you can mprotect(PROT_NONE) the relevant
regions.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Lakhin | 2026-02-16 18:00:00 | Re: race condition in pg_class |
| Previous Message | Andres Freund | 2026-02-16 17:50:29 | Re: pgstat include expansion |