Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2025-09-11 15:41:23
Message-ID: 1d57d68d-b178-415a-ba11-be0c3714638e@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/11/25 10:32, Tomas Vondra wrote:
> ...
>
> For example, we may get confused about the memory page size. The "size"
> happens before allocation, and at that point we don't know if we succeed
> in getting enough huge pages. When "init" happens, we already know that,
> so our "memory page size" could be different. We must be careful, e.g.
> to not need more memory than we requested.

I forgot to mention the other issue with huge pages on NUMA. I already
reported [1] it's trivial to crash with a SIGBUS, because

(1) huge pages get reserved on all NUMA nodes (evenly)

(2) the decision whether to use huge pages is done by mmap(), which only
needs to check if there are enough huge pages in total

(3) numa_tonode_memory is called later, and does not verify if the
target node has enough free pages (I'm not sure it should / can)

(4) we only partition (and locate to NUMA nodes) some of the memory, and
the rest (which is much smaller, but still sizeable) is likely causing
"imbalance" - it gets placed on one (random) node, and it then does not
have enough space for the stuff we explicitly placed there

(5) then at some point we try accessing one of the shared buffers, that
triggers page fault, tries to get a huge page on the NUMA node, realizes
there are no free huge pages, and crashes with SIGBUS

It clearly is not an option to just let it crash, but I still don't have
a great idea how to address it. The only idea I have is to manually
interleave the whole shared memory (when using huge pages), page by
page, so that this imbalance does not happen.

But it's harder than it looks, because we don't necessarily partition
everything evenly. For example, one node can get a smaller chunk of
shared buffers, because we try to partition buffers and buffers
descriptors in a "nice" way. The PGPROC stuff is also not distributed
quite evenly (e.g. aux/2pc entries are not mapped to any node).

A different approach would be to calculate how many per-node huge pages
we'll need (for the stuff we partition explicitly - buffers and PGPROC),
and then the rest of the memory that can get placed on any node. And
require the "maximum" number of pages that can get placed on any node.
But that's annoying wasteful, because every other node will end up with
unusable memory.

regards

[1]
https://www.postgresql.org/message-id/71a46484-053c-4b81-ba32-ddac050a8b5d%40vondra.me

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2025-09-11 15:54:52 Re: OAuth client code doesn't work with Google OAuth
Previous Message Tom Lane 2025-09-11 15:36:20 Re: Making type Datum be 8 bytes everywhere