| From: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, chaturvedipalak1911(at)gmail(dot)com |
| Subject: | Re: Better shared data structure management and resizable shared data structures |
| Date: | 2026-02-23 14:14:23 |
| Message-ID: | CAExHW5so6VSxBC-1V=35229Z1+dw5vhw8HxHg9ry7UzceKcXzA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Feb 18, 2026 at 9:17 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > > 4. the address and length passed to madvise needs to be page aligned,
> > > but that passed to fallocate() needn't be. `man fallocate` says
> > > "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> > > 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> > > range starting at offset and continuing for len bytes. Within the
> > > specified range, partial filesystem blocks are zeroed, and whole
> > > filesystem blocks are removed from the file.". It seems to be
> > > automatically taking care of the page size. So using fallocate()
> > > simplifies logic. Further `man madvise` says "but since Linux 3.5, any
> > > filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode
> > > also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is
> > > guaranteed to be available on a system which supports MADV_REMOVE.
> >
> > I think it makes no sense to support resizing below page size
> > granularity. What's the point of doing that?
> >
>
> No point really. But we can not control the extensions which want to
> specify a maximum size smaller than a page size. They wouldn't know
> what page size the underlying machine will have, especially with huge
> pages which have a wide range of sizes. Even in the case of shared
> buffers, a value of max_shared_buffers may cause buffer blocks to span
> pages but other structures may fit a page.
>
> In the attached patches, if a resizable structure is such that its
> max_size is smaller than a page size, it is treated as a fixed
> structure with size = max_size. Any request to resize such structures
> will simply update the metadata without actual madvise operation. Only
> the structures whose max_size > page_size would be treated as truly
> resizable and will use madvise. You bring another interesting point.
> If a resizable structure has a maximum size higher than the page size,
> but it is allocated such that the initial part of it is on a partially
> allocated page and the last part of it is on another partially
> allocated page, those pages are never freed because of adjoining
> structures. Per the logic in the attached patches, all the fixed (or
> pseudo-resizable structures) are packed together. The resizable
> structures start on a page boundary and their max_sizes are adjusted
> to be page aligned. That way we can release pages when the structure
> shrinks more than a page.
It was a mistake on my part to assume that more memory will be freed
if we page align the start and end of a resizable structure. I didn't
account for the memory wasted in alignment itself. That amount comes
out to be same as the amount of memory wasted if we don't page align
the structure. But the code is simpler if we don't page align the
structure as seen in the attached patches.
> > >
> > > > Using fallocate() (or madvise()) to free memory, we don't need
> > > > multiple segments. So much less code churn compared to the multiple
> > > > mappings approach. However, there is one drawback. In the multiple
> > > > mapping approach access beyond the current size of the structure would
> > > > result in segfault or bus error. But in the fallocate/madvise approach
> > > > such an access does not cause a crash. A write beyond the pages that
> > > > fit the current size of the structure causes more memory to be
> > > > allocated silently. A read returns 0s. So, there's a possibility that
> > > > bugs in size calculations might go unnoticed. I think that's how it
> > > > works even today, access in the yet un-allocated part of the shared
> > > > memory will simply go unnoticed.
> > >
> > > If that's something you care about, you can mprotect(PROT_NONE) the relevant
> > > regions.
> >
> > I am fine, if we let go of this protection while getting rid of
> > multiple segments, if we all agree to do so.
> >
> > I could be wrong, but mprotect needs to be executed in every backend
> > where the memory is mapped and then a new backend needs to inherit it
> > from the postmaster. Makes resizing complex since it has to touch
> > every backend. So avoiding mprotect is better.
I discussed this point with Andres offlist. Here's a summary of that
discussion. Any serious users of resizable shared memory structures
would need to send proc signal barriers to synchronize the resizing
across the backends. This barrier can be used to perform mprotect() in
the backends and a separate signal to Postmaster, if mprotect is
needed in Postmaster. But whether mprotect is needed depends upon the
usecase. It should be responsibility of the resizable structure user
and not of the ShmemResizeRegistered()
Following points need a bit of discussion.
1. calculation of allocated_size
For fixed sized shared memory structures, allocated_size is the size
of the structure after cache aligning it. Assuming that the shared
memory is allcoated in pages, this also is the actually memory
allocated to the structure when the whole structure is written to. For
resizable structure, it's a bit more complicated. We allocate and
reserve the maximum space required by the structure. At a given point
in time, the memory page where the next structure begins and the page
which contains the end of the structure at that point in time are
allocated. The pages in-between are not allocated. Thus the
allocated_size should be the length from the start the structure to
the end of the page containing the current end of the structure + part
of the page where the next structure starts upto the start of the next
structure. That is what is implemented in the attached patches.
2. GUCs shared_memory_size, shared_memory_size_in_huge_pages
These GUCs indicate the size of the shared memory in bytes and in huge
pages. Without resizable shared memory structures calculating these is
straight forward, we sum all the sizes of all the requested
structures. With resizable shared memory structures, these GUCs do not
make much sense. Since the memory allocated to the resizable
structures can be anywhere between 0 to maximum, neither the sum of
the their initial sizes nor the sum of their maximum sizes can be
reported as shared_memory_size. Similarly for
shared_memory_size_in_huge_pages. We need two GUCs to replace each of
the existing GUCs - max_shared_memory_size, initial_shared_memory_size
and their huge page peers. max_shared_memory_size is the sum of the
maximum sizes of resizable structures + the requested sizes of the
fixed structure. initial_shared_memory_size is the sum of the initial
sizes requested for all the structures.
3. Testing the memory allocation
I couldn't find a way to reliably know the shared memory allocated at
a given address in the process. RSS Shmem given the amount of shared
memory accessed by the process which includes memory allocated to the
fixed structures accessed by the process. This value isn't stable
across runs of the test in the patch. The test adds the RSS shmem
reported against the variations in the resizable shared memory
structure which can be visually inspected to be within limits. But
those limits are hard to test in the test code. Looking for some
suggestions here.
Disabling resizable structures in the builds which do not support
resizable structures is still a TODO.
--
Best Wishes,
Ashutosh Bapat
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-wip-Introduce-a-new-way-of-registering-shar-20260223.patch | text/x-patch | 53.8 KB |
| 0002-Get-rid-of-global-shared-memory-pointer-mac-20260223.patch | text/x-patch | 15.0 KB |
| 0003-WIP-resizable-shared-memory-structures-20260223.patch | text/x-patch | 38.9 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Daniel Gustafsson | 2026-02-23 14:32:02 | Re: Add ssl_(supported|shared)_groups to sslinfo |
| Previous Message | Akshay Joshi | 2026-02-23 13:57:03 | Re: [PATCH] Add pg_get_database_ddl() function to reconstruct CREATE DATABASE statement |