Re: NUMA shared memory interleaving

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: NUMA shared memory interleaving
Date: 2025-06-27 16:41:30
Message-ID: 6686e42d-6e2d-4c39-8fc5-d6574604fb8e@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I agree we should improve the behavior on NUMA systems. But I'm not sure
this patch does far enough, or perhaps the approach seems a bit too
blunt, ignoring some interesting stuff.

AFAICS the patch essentially does the same thing as

numactl --interleave=all

except that it only does that to shared memory, not to process private
memory (as if we called numa_set_localalloc). Which means it has some of
the problems people observe with --interleave=all.

In particular, this practically guarantees that (with 4K memory pages),
each buffer hits multiple NUMA nodes. Because with the first half will
do to node N, while the second half goes to node (N+1).

That doesn't seem great. It's likely better than a misbalanced system
with everything allocated on a single NUMA node, but I don't see how it
could be better than "balanced" properly warmed up system where the
buffers are not split like this.

But OK, admittedly this only happens for 4K memory pages, and serious
systems with a lot of memory are likely to use huge pages, which makes
this less of an issue (only the buffers crossing the page boundaries
might get split).

My bigger comment however is that the approach focuses on balancing the
nodes (i.e. ensuring each node gets a fair share of shared memory), and
is entirely oblivious to the internal structure of the shared memory.

* It interleaves the shared segment, but it has many pieces - shared
buffers are the largest but not the only one. Does it make sense to
interleave all the other pieces?

* Some of the pieces are tightly related. For example, we talk about
shared buffers as if it was one big array, but it actually is two arrays
- blocks and descriptors. Even if buffers don't get split between nodes
(thanks to huge pages), there's no guarantee the descriptor for the
buffer does not end on a different node.

* In fact, the descriptors are so much smaller that blocks that it's
practically guaranteed all descriptors will end up on a single node.

I could probably come up with a couple more similar items, but I think
you get the idea. I do think making Postgres NUMA-aware will require
figuring out how to distribute (or not distribute) different parts of
the shared memory, and do that explicitly. And do that in a way that
allows us to do other stuff in NUMA-aware way, e.g. have a separate
freelists and clocksweep for each NUMA node, etc.

That's something numa_interleave_memory simply can't do for us, and I
suppose it might also have other downsides on large instances. I mean,
doesn't it have to create a separate mapping for each memory page?
Wouldn't that be a bit inefficient/costly for big instances?

Of course, I'm not saying all this as a random passerby - I've been
working on a similar patch for a while, based on Andres' experimental
NUMA branch. It's far from complete/perfect, more of a PoC quality, but
I hope to share it on the mailing list sometime soon.

FWIW while I think the patch doesn't go far enough, there's one area
where I think it probably goes way too far - configurability. I agree
it's reasonable to allow running on a subset of nodes, e.g. to split the
system between multiple instances etc. But do we need to configure that
from Postgres? Aren't people likely to already use something like
containers or k8 anyway? I think we should just try to inherit this from
the environment, i.e. determine which nodes we're allowed to run, and
use that. Maybe we'll find we need to be smarter, but I think we caan
leave that for later.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-06-27 17:24:28 libpq OpenSSL and multithreading
Previous Message vignesh C 2025-06-27 15:58:44 Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages