Re: Adding basic NUMA awareness

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2025-07-10 14:17:21
Message-ID: aG/LcTxyVT1DtoB4@ip-10-97-1-34.eu-west-3.compute.internal
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Jul 09, 2025 at 03:42:26PM -0400, Andres Freund wrote:
> Hi,
>
> Thanks for working on this!

Indeed, thanks!

> On 2025-07-01 21:07:00 +0200, Tomas Vondra wrote:
> > 1) v1-0001-NUMA-interleaving-buffers.patch
> >
> > This is the main thing when people think about NUMA - making sure the
> > shared buffers are allocated evenly on all the nodes, not just on a
> > single node (which can happen easily with warmup). The regular memory
> > interleaving would address this, but it also has some disadvantages.
> >
> > Firstly, it's oblivious to the contents of the shared memory segment,
> > and we may not want to interleave everything. It's also oblivious to
> > alignment of the items (a buffer can easily end up "split" on multiple
> > NUMA nodes), or relationship between different parts (e.g. there's a
> > BufferBlock and a related BufferDescriptor, and those might again end up
> > on different nodes).
>
> Two more disadvantages:
>
> With OS interleaving postgres doesn't (not easily at least) know about what
> maps to what, which means postgres can't do stuff like numa aware buffer
> replacement.
>
> With OS interleaving the interleaving is "too fine grained", with pages being
> mapped at each page boundary, making it less likely for things like one
> strategy ringbuffer to reside on a single numa node.

> > There's a secondary benefit of explicitly assigning buffers to nodes,
> > using this simple scheme - it allows quickly determining the node ID
> > given a buffer ID. This is helpful later, when building freelist.

I do think this is a big advantage as compare to the OS interleaving.

> I wonder if we should *increase* the size of shared_buffers whenever huge
> pages are in use and there's padding space due to the huge page
> boundaries. Pretty pointless to waste that memory if we can instead use if for
> the buffer pool. Not that big a deal with 2MB huge pages, but with 1GB huge
> pages...

I think that makes sense, except maybe for operations that need to scan
the whole buffer pool (i.e related to BUF_DROP_FULL_SCAN_THRESHOLD)?

> > 5) v1-0005-NUMA-interleave-PGPROC-entries.patch
> >
> > Another area that seems like it might benefit from NUMA is PGPROC, so I
> > gave it a try. It turned out somewhat challenging. Similarly to buffers
> > we have two pieces that need to be located in a coordinated way - PGPROC
> > entries and fast-path arrays. But we can't use the same approach as for
> > buffers/descriptors, because
> >
> > (a) Neither of those pieces aligns with memory page size (PGPROC is
> > ~900B, fast-path arrays are variable length).
>
> > (b) We could pad PGPROC entries e.g. to 1KB, but that'd still require
> > rather high max_connections before we use multiple huge pages.
>
> Right now sizeof(PGPROC) happens to be multiple of 64 (i.e. the most common
> cache line size)

Oh right, it's currently 832 bytes and the patch extends that to 840 bytes.

With a bit of reordering:

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5cb1632718e..2ed2f94202a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -194,8 +194,6 @@ struct PGPROC
* vacuum must not remove tuples deleted by
* xid >= xmin ! */

- int procnumber; /* index in ProcGlobal->allProcs */
-
int pid; /* Backend's process ID; 0 if prepared xact */

int pgxactoff; /* offset into various ProcGlobal->arrays with
@@ -243,6 +241,7 @@ struct PGPROC

/* Support for condition variables. */
proclist_node cvWaitLink; /* position in CV wait list */
+ int procnumber; /* index in ProcGlobal->allProcs */

/* Info about lock the process is currently waiting for, if any. */
/* waitLock and waitProcLock are NULL if not currently waiting. */
@@ -268,6 +267,7 @@ struct PGPROC
*/
XLogRecPtr waitLSN; /* waiting for this LSN or higher */
int syncRepState; /* wait state for sync rep */
+ int numa_node;
dlist_node syncRepLinks; /* list link if process is in syncrep queue */

/*
@@ -321,9 +321,6 @@ struct PGPROC
PGPROC *lockGroupLeader; /* lock group leader, if I'm a member */
dlist_head lockGroupMembers; /* list of members, if I'm a leader */
dlist_node lockGroupLink; /* my member link, if I'm a member */
-
- /* NUMA node */
- int numa_node;
};

That could be back to 832 (the order does not make sense logically speaking
though).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Benjamin Coutu 2025-07-10 14:24:08 Using ASSUME in place of ASSERT in non-assert builds
Previous Message Dean Rasheed 2025-07-10 14:06:44 Re: Improving and extending int128.h to more of numeric.c