From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Adding basic NUMA awareness |
Date: | 2025-09-11 15:41:23 |
Message-ID: | 1d57d68d-b178-415a-ba11-be0c3714638e@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 9/11/25 10:32, Tomas Vondra wrote:
> ...
>
> For example, we may get confused about the memory page size. The "size"
> happens before allocation, and at that point we don't know if we succeed
> in getting enough huge pages. When "init" happens, we already know that,
> so our "memory page size" could be different. We must be careful, e.g.
> to not need more memory than we requested.
I forgot to mention the other issue with huge pages on NUMA. I already
reported [1] it's trivial to crash with a SIGBUS, because
(1) huge pages get reserved on all NUMA nodes (evenly)
(2) the decision whether to use huge pages is done by mmap(), which only
needs to check if there are enough huge pages in total
(3) numa_tonode_memory is called later, and does not verify if the
target node has enough free pages (I'm not sure it should / can)
(4) we only partition (and locate to NUMA nodes) some of the memory, and
the rest (which is much smaller, but still sizeable) is likely causing
"imbalance" - it gets placed on one (random) node, and it then does not
have enough space for the stuff we explicitly placed there
(5) then at some point we try accessing one of the shared buffers, that
triggers page fault, tries to get a huge page on the NUMA node, realizes
there are no free huge pages, and crashes with SIGBUS
It clearly is not an option to just let it crash, but I still don't have
a great idea how to address it. The only idea I have is to manually
interleave the whole shared memory (when using huge pages), page by
page, so that this imbalance does not happen.
But it's harder than it looks, because we don't necessarily partition
everything evenly. For example, one node can get a smaller chunk of
shared buffers, because we try to partition buffers and buffers
descriptors in a "nice" way. The PGPROC stuff is also not distributed
quite evenly (e.g. aux/2pc entries are not mapped to any node).
A different approach would be to calculate how many per-node huge pages
we'll need (for the stuff we partition explicitly - buffers and PGPROC),
and then the rest of the memory that can get placed on any node. And
require the "maximum" number of pages that can get placed on any node.
But that's annoying wasteful, because every other node will end up with
unusable memory.
regards
[1]
https://www.postgresql.org/message-id/71a46484-053c-4b81-ba32-ddac050a8b5d%40vondra.me
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2025-09-11 15:54:52 | Re: OAuth client code doesn't work with Google OAuth |
Previous Message | Tom Lane | 2025-09-11 15:36:20 | Re: Making type Datum be 8 bytes everywhere |