| From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, chaturvedipalak1911(at)gmail(dot)com |
| Subject: | Re: Better shared data structure management and resizable shared data structures |
| Date: | 2026-02-16 14:52:51 |
| Message-ID: | CAExHW5s9Vp+-vJi020UJ+otyccBBo7eT1g6bttdRKL6HAvscyQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 13/02/2026 13:47, Ashutosh Bapat wrote:
> > `man madvise` has this
> > MADV_REMOVE (since Linux 2.6.16)
> > Free up a given range of pages and its associated
> > backing store. This is equivalent to punching a
> > hole in the corresponding byte range of the backing
> > store (see fallocate(2)). Subsequent accesses
> > in the specified address range will see bytes containing zero.
> >
> > The specified address range must be mapped shared
> > and writable. This flag cannot be applied to
> > locked pages, Huge TLB pages, or VM_PFNMAP pages.
> >
> > In the initial implementation, only tmpfs(5) was
> > supported MADV_REMOVE; but since Linux 3.5, any
> > filesystem which supports the fallocate(2)
> > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE.
> > Hugetlbfs fails with the error EINVAL and other
> > filesystems fail with the error EOPNOTSUPP.
> >
> > It says the flag can not be applied to Huge TLB pages. We won't be
> > able to make resizable shared memory structures allocated with huge
> > pages. That seems like a serious restriction.
>
> Per https://man7.org/linux/man-pages/man2/madvise.2.html:
>
> MADV_REMOVE (since Linux 2.6.16)
> ...
>
> Support for the Huge TLB filesystem was added in Linux
> v4.3.
>
> > I may be misunderstanding something, but it seems like this is useful
> > to free already allocated memory, not necessarily allocate more
> > memory. I don't understand how a user would start with a larger
> > reserved address space with only small portions of that space being
> > backed by memory.
>
> Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call.
> to reserve address space for the maximum size, and then
> madvise(MADV_POPULATE_WRITE) using the initial size. Later,
> madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow
> again.
Thank you for the hint. Also thanks to Andres's idea, the resizable
structure patch is quite small now. Actually, after experimenting with
madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE
is not required at all. We don't have to do anything to expand a
structure. Memory will be allocated as and when the program writes to
it. I also discovered things that I didn't know about.
1. ftruncate() sets the size of the file but it doesn't allocate the
memory pages.
2. to use madvise() the address needs to be backed by a file, so
memfd_create is a must.
3. We can't write to a file backed memory at a location beyond the
size of the file. Hence we have to set the size of the file to the
maximum size at the beginning.
4. the address and length passed to madvise needs to be page aligned,
but that passed to fallocate() needn't be. `man fallocate` says
"Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
range starting at offset and continuing for len bytes. Within the
specified range, partial filesystem blocks are zeroed, and whole
filesystem blocks are removed from the file.". It seems to be
automatically taking care of the page size. So using fallocate()
simplifies logic. Further `man madvise` says "but since Linux 3.5, any
filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
guaranteed to be available on a system which supports MADV_REMOVE.
Using fallocate() (or madvise()) to free memory, we don't need
multiple segments. So much less code churn compared to the multiple
mappings approach. However, there is one drawback. In the multiple
mapping approach access beyond the current size of the structure would
result in segfault or bus error. But in the fallocate/madvise approach
such an access does not cause a crash. A write beyond the pages that
fit the current size of the structure causes more memory to be
allocated silently. A read returns 0s. So, there's a possibility that
bugs in size calculations might go unnoticed. I think that's how it
works even today, access in the yet un-allocated part of the shared
memory will simply go unnoticed.
PFA the patches with 0003 implementing resizable structures using
fallocate(). There are TODOs, and also I need to make sure that
resizable structures are disabled where memfd_create(), fallocate()
and anonymous memory mappings are not available. Also the test is
unstable since it prints the memory consumption numbers obtained from
/proc/self/status. But it demonstrates that allocation and freeing of
shared memory as the shared structures undergo resizing. I don't think
there is a stable way to use the numbers though; so we might have to
remove those ultimately.
--
Best Wishes,
Ashutosh Bapat
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-wip-Introduce-a-new-way-of-registering-shar-20260216.patch | text/x-patch | 53.8 KB |
| 0002-Get-rid-of-global-shared-memory-pointer-mac-20260216.patch | text/x-patch | 15.0 KB |
| 0003-WIP-resizable-shared-memory-structures-20260216.patch | text/x-patch | 45.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Fujii Masao | 2026-02-16 14:57:03 | Show comments in \dRp+, \dRs+, and \dX+ psql meta-commands |
| Previous Message | Robert Treat | 2026-02-16 14:24:49 | Re: Proposal: SELECT * EXCLUDE (...) command |