Quick Links

Re: NUMA shared memory interleaving

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: NUMA shared memory interleaving
Date:	2025-06-30 19:23:43
Message-ID:	236c633b-99d3-4788-b87d-7bd50fcacf50@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 6/30/25 12:55, Jakub Wartak wrote:
> Hi Tomas!
>
> On Fri, Jun 27, 2025 at 6:41 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>> I agree we should improve the behavior on NUMA systems. But I'm not sure
>> this patch does far enough, or perhaps the approach seems a bit too
>> blunt, ignoring some interesting stuff.
>>
>> AFAICS the patch essentially does the same thing as
>>
>> numactl --interleave=all
>>
>> except that it only does that to shared memory, not to process private
>> memory (as if we called numa_set_localalloc). Which means it has some of
>> the problems people observe with --interleave=all.
>>
>> In particular, this practically guarantees that (with 4K memory pages),
>> each buffer hits multiple NUMA nodes. Because with the first half will
>> do to node N, while the second half goes to node (N+1).
>>
>> That doesn't seem great. It's likely better than a misbalanced system
>> with everything allocated on a single NUMA node, but I don't see how it
>> could be better than "balanced" properly warmed up system where the
>> buffers are not split like this.
>>
>> But OK, admittedly this only happens for 4K memory pages, and serious
>> systems with a lot of memory are likely to use huge pages, which makes
>> this less of an issue (only the buffers crossing the page boundaries
>> might get split).
>>
>>
>> My bigger comment however is that the approach focuses on balancing the
>> nodes (i.e. ensuring each node gets a fair share of shared memory), and
>> is entirely oblivious to the internal structure of the shared memory.
>>
>> * It interleaves the shared segment, but it has many pieces - shared
>> buffers are the largest but not the only one. Does it make sense to
>> interleave all the other pieces?
>>
>> * Some of the pieces are tightly related. For example, we talk about
>> shared buffers as if it was one big array, but it actually is two arrays
>> - blocks and descriptors. Even if buffers don't get split between nodes
>> (thanks to huge pages), there's no guarantee the descriptor for the
>> buffer does not end on a different node.
>>
>> * In fact, the descriptors are so much smaller that blocks that it's
>> practically guaranteed all descriptors will end up on a single node.
>>
>>
>> I could probably come up with a couple more similar items, but I think
>> you get the idea. I do think making Postgres NUMA-aware will require
>> figuring out how to distribute (or not distribute) different parts of
>> the shared memory, and do that explicitly. And do that in a way that
>> allows us to do other stuff in NUMA-aware way, e.g. have a separate
>> freelists and clocksweep for each NUMA node, etc.
>
> I do understand what you mean, but I'm *NOT* stating here that it
> makes PG fully "NUMA-aware". I actually try to avoid doing so with
> each sentence. This is only about the imbalance problem specifically.
> I think we could build those follow-up optimizations as separate
> patches in this or follow-up threads. If we would do it all in one
> giant 0001 (without split) the very first question would be to
> quantify the impact of each of those optimizations (for which we would
> probably need more GUCs?). Here I'm just showing that the very first
> baby step - interleaving - helps avoid interconnect saturation in some
> cases too.
>
> Anyway, even putting the fact that local mallocs() would be
> interleaved, adjusting systemd startup scripts to just include
> `numactl --interleave=all` sounds like some dirty hack not like proper
> UX.
>

I wasn't suggesting to do "numactl --interleave=all". My argument was
simply that doing numa_interleave_memory() has most of the same issues,
because it's oblivious to what's stored in the shared memory. Sure, the
fact that local memory is not interleaved too is an improvement.

But I just don't see how this could be 0001, followed by some later
improvements. ISTM the improvements would have to largely undo 0001
first, and it would be nontrivial if an optimization needs to do that
only for some part of the shared memory.

> Also please note that:
> * I do not have lot of time to dedicate towards it, yet I was kind of
> always interested in researching that and wondering why we couldn't it
> for such long time, therefore the previous observability work and now
> $subject (note it is not claiming to be full blown NUMA awareness,
> just some basic NUMA interleave as first [well, second?] step).

Sorry, I appreciate the time you spent working on these features. It
wasn't my intention to dunk on your patch. I'm afraid this is an example
of how reactions on -hackers are often focused on pointing out issues. I
apologize for that, I should have realized it earlier.

I certainly agree it'd be good to improve the NUMA support, otherwise I
wouldn't be messing with Andres' PoC patches myself.

> * I've raised this question in the first post "How to name this GUC
> (numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
> simply looks better, and we could just add way more stuff to it over
> time (in PG19 or future versions?). Does that sound good?
>

I'm not sure. In my WIP patch I have a bunch of numa_ GUCs, for
different parts of the shared memory. But that's mostly for development,
to allow easy experimentation. I don't have a clear idea what UX should
look like.

>> That's something numa_interleave_memory simply can't do for us, and I
>> suppose it might also have other downsides on large instances. I mean,
>> doesn't it have to create a separate mapping for each memory page?
>> Wouldn't that be a bit inefficient/costly for big instances?
>
> No? Or what kind of mapping do you have in mind? I think our shared
> memory on the kernel side is just a single VMA (contiguous memory
> region), on which technically we execute mbind() (libnuma is just a
> wrapper around it). I have not observed any kind of regressions,
> actually quite the opposite. Not sure what you also mean by 'big
> instances' (AFAIK 1-2TB shared_buffers might even fail to start).
>

Something as simple as giving a contiguous chunk of to each NUMA node.
Essentially 1/nodes goes to the first NUMA node, and so on. I haven't
looked into the details of how NUMA interleaving works, but from the
discussions I had about it, I understood it might be expensive. Not
sure, maybe that's wrong.

But the other reason for a simpler mapping is that it seems useful to be
able to easily calculate which NUMA node a buffer belongs to. Because
then you can do NUMA-aware freelists, clocksweep, etc.

>> Of course, I'm not saying all this as a random passerby - I've been
>> working on a similar patch for a while, based on Andres' experimental
>> NUMA branch. It's far from complete/perfect, more of a PoC quality, but
>> I hope to share it on the mailing list sometime soon.
>
> Cool, I didn't know Andres's branch was public till now, I know he
> referenced multiple issues in presentation (and hackathon!), but I
> wanted to divide it and try to get something in at least partially,
> step by step, to have at least something. I think we should
> collaborate (not a lot of people interested in this?) and I can try to
> offer my limited help if you attack those more advanced problems. I
> think we could improve this by properly ensuring that by
> over(allocating)/spreading/padding certain special regions (e.g.
> better distribute ProcArray, but what about cache hits?) - we get more
> juice, or do you want to start from scratch and re-design/re-think all
> shm allocations case by case?
>

+1 to collaboration, absolutely. I was actually planning to ping you
once I have something workable. I hope I'll be able to polish the WIP
patches a little bit and post them sometime this week.

>> FWIW while I think the patch doesn't go far enough, there's one area
>> where I think it probably goes way too far - configurability. I agree
>> it's reasonable to allow running on a subset of nodes, e.g. to split the
>> system between multiple instances etc. But do we need to configure that
>> from Postgres? Aren't people likely to already use something like
>> containers or k8 anyway?
>> I think we should just try to inherit this from
>> the environment, i.e. determine which nodes we're allowed to run, and
>> use that. Maybe we'll find we need to be smarter, but I think we caan
>> leave that for later.
>
> That's what "numa=all" is all about (take whatever is there in the
> OS/namespace), but I do not know a better way than just let's say
> numa_get_mems_allowed() being altered somehow by namespace/cgroups. I
> think if one runs on k8/containers then it's quite limited/small
> deployment and he wouldn't benefit from this at all (I struggle to
> imagine the point of k8 pod using 2+ sockets), quite contrary: my
> experience indicates that the biggest deployments are usually almost
> baremetal? And it's way easier to get consistent results. Anyway as
> You say, let's leave it for later. PG currently often is not CPU-aware
> (i.e. is not even adjusting sizing of certain structs based on CPU
> count), so making it NUMA-aware or cgroup/namespace-aware sounds
> already like taking 2-3 steps ahead into future [I think we had
> discussion at least one in LWLock partitionmanager /
> FP_LOCK_SLOTS_PER_BACKEND where I've proposed to size certain
> structures based on $VCPUs or I am misremembering this]
>

+1 to leave this for later, we can worry about this once we have it
working with the basic whole-system NUMA setups. I hope people doing
some of this would give us feedback what config they actually need.

regards

--
Tomas Vondra

In response to

Re: NUMA shared memory interleaving at 2025-06-30 10:55:13 from Jakub Wartak

Responses

Re: NUMA shared memory interleaving at 2025-07-01 09:04:46 from Jakub Wartak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Davis	2025-06-30 19:28:47	Re: Improve the performance of Unicode Normalization Forms.
Previous Message	Jeff Davis	2025-06-30 19:21:47	Re: Collation & ctype method table, and extension hooks