From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Adding basic NUMA awareness |
Date: | 2025-08-12 14:24:15 |
Message-ID: | pkboeixpuptpkv56tebehlfer6htcqcxistzkav7xa2hrmwz6c@dfmt524owiyq |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote:
> Right. I don't think the current patch would crash - I can't test it,
> but I don't see why it would crash. In the worst case it'd end up with
> partitions that are not ideal. The question is more what would an ideal
> partitioning for buffers and PGPROC look like. Any opinions?
>
> For PGPROC, it's simple - it doesn't make sense to allocate partitions
> for nodes without CPUs.
>
> For buffers, it probably does not really matter if a node does not have
> any CPUs. If a node does not have any CPUs, that does not mean we should
> not put any buffers on it. After all, CXL will never have any CPUs (at
> least I think that's the case), and not using it for shared buffers
> would be a bit strange. Although, it could still be used for page cache.
For CXL memory to be really usable, I think we'd need nontrivial additional
work. CXL memory has considerably higher latency and lower throughput. You'd
*never* want things like BufferDescs or such on such nodes. And even the
buffered data itself, you'd want to make sure that frequently used data,
e.g. inner index pages, never end up on it.
Which leads to:
> Maybe it should be "tiered" a bit more?
Yes, for proper CXL support, we'd need a component that explicitly demotes and
promotes pages from "real" memory to CXL memory and the other way round. The
demotion is relatively easy, you'd probably just do it whenever you'd
otherwise throw out a victim buffer. When to promote back is harder...
> The patch differentiates only between partitions on "my" NUMA node vs. every
> other partition. Maybe it should have more layers?
Given the relative unavailability of CXL memory systems, I think just not
crashing is good enough for now...
> >> I'm not sure what to do about this (or how getcpu() or libnuma handle this).
> >
> > I don't immediately see any libnuma functions that would care?
> >
>
> Not sure what "care" means here. I don't think it's necessarily broken,
> it's more about the APIs not making the situation very clear (or
> convenient).
What I mean is that I was looking through the libnuma functions and didn't see
any that would be affected by having multiple "local" NUMA nodes. But:
> How do you determine nodes for a CPU, for example? The closest thing I
> see is numa_node_of_cpu(), but that only returns a single node. Or how
> would you determine the number of nodes with CPUs (so that we create
> PGPROC partitions only for those)? I suppose that requires literally
> walking all the nodes.
I didn't think of numa_node_of_cpu().
As long as numa_node_of_cpu() returns *something* I think it may be good
enough. Nobody uses an RPi for high-throughput postgres workloads with a lot
of memory. Slightly sub-optimal mappings should really not matter.
I'm kinda wondering if we should deal with such fake numa systems by detecting
them and disabling our numa support.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-08-12 14:31:32 | Re: Annoying warning in SerializeClientConnectionInfo |
Previous Message | Andres Freund | 2025-08-12 14:11:15 | Re: Making type Datum be 8 bytes everywhere |