Re: Adding basic NUMA awareness

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2025-07-09 10:04:00
Message-ID: CAKZiRmy4EGAGvHjEEEwqm8m_su_xtW5ZLHLLZJQkU-ier=fqrQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 8, 2025 at 2:56 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2025-07-08 14:27:12 +0200, Tomas Vondra wrote:
> > On 7/8/25 05:04, Andres Freund wrote:
> > > On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
> > > The reason it would be advantageous to put something like the procarray onto
> > > smaller pages is that otherwise the entire procarray (unless particularly
> > > large) ends up on a single NUMA node, increasing the latency for backends on
> > > every other numa node and increasing memory traffic on that node.
> > >

Sure thing, I fully understand the motivation and underlying reason
(without claiming that I understand the exact memory access patterns
that involve procarray/PGPROC/etc and hotspots involved from PG side).
Any single-liner pgbench help for how to really easily stress the
PGPROC or procarray?

> > That's why the patch series splits the procarray into multiple pieces,
> > so that it can be properly distributed on multiple NUMA nodes even with
> > huge pages. It requires adjusting a couple places accessing the entries,
> > but it surprised me how limited the impact was.

Yes, and we are discussing if it is worth getting into smaller pages
for such usecases (e.g. 4kB ones without hugetlb with 2MB hugepages or
what more even more waste 1GB hugetlb if we dont request 2MB for some
small structs: btw, we have ability to select MAP_HUGE_2MB vs
MAP_HUGE_1GB). I'm thinking about two problems:
- 4kB are swappable and mlock() potentially (?) disarms NUMA autobalacning
- using libnuma often leads to MPOL_BIND which disarms NUMA
autobalancing, BUT apparently there are set_mempolicy(2)/mbind(2) and
since 5.12+ kernel they can take additional flag
MPOL_F_NUMA_BALANCING(!), so this looks like it has potential to move
memory anyway (if way too many tasks are relocated, so would be
memory?). It is available only in recent libnuma as
numa_set_membind_balancing(3), but sadly there's no way via libnuma to
do mbind(MPOL_F_NUMA_BALANCING) for a specific addr only? I mean it
would have be something like MPOL_F_NUMA_BALANCING | MPOL_PREFERRED?
(select one node from many for each node while still allowing
balancing?), but in [1][2] (2024) it is stated that "It's not
legitimate (yet) to use MPOL_PREFERRED + MPOL_F_NUMA_BALANCING.", but
maybe stuff has been improved since then.

Something like:
PGPROC/procarray 2MB page for node#1 - mbind(addr1,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [0,1]);
PGPROC/procarray 2MB page for node#2 - mbind(addr2,
MPOL_F_NUMA_BALANCING | MPOL_PREFERRED, [1,0]);

> Sure, you can do that, but it does mean that iterations over the procarray now
> have an added level of indirection...

So the most efficient would be the old-way (no indirections) vs
NUMA-way? Can this be done without #ifdefs at all?

> > The thing I'm not sure about is how much this actually helps with the
> > traffic between node. Sure, if we pick a PGPROC from the same node, and
> > the task does not get moved, it'll be local traffic. But if the task
> > moves, there'll be traffic.

With MPOL_F_NUMA_BALANCING, that should "auto-tune" in the worst case?

> > I don't have any estimates how often this happens, e.g. for older tasks.

We could measure, kernel 6.16+ has per PID numa_task_migrated in
/proc/{PID}/sched , but I assume we would have to throw backends >>
VCPUs at it, to simulate reality and do some "waves" between different
activity periods of certain pools (I can imagine worst case scenario:
a) pgbench "a" open $VCPU connections, all idle, with scripto to sleep
for a while
b) pgbench "b" open some $VCPU new connections to some other DB, all
active from start (tpcbb or readonly)
c) manually ping CPUs using taskset for each PID all from "b" to
specific NUMA node #2 -- just to simulate unfortunate app working on
every 2nd conn
d) pgbench "a" starts working and hits CPU imbalance -- e.g. NUMA node
#1 is idle, #2 is full, CPU scheduler starts puting "a" backends on
CPUs from #1 , and we should notice PIDs being migrated)

> I think the most important bit is to not put everything onto one numa node,
> otherwise the chance of increased latency for *everyone* due to the increased
> memory contention is more likely to hurt.

-J.

p.s. I hope i did write in an understandable way, because I had many
interruptions, so if anything is unclear please let me know.

[1] - https://lkml.org/lkml/2024/7/3/352
[2] - https://lkml.rescloud.iu.edu/2402.2/03227.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2025-07-09 10:41:22 Re: Logical Replication of sequences
Previous Message Mircea Cadariu 2025-07-09 09:51:16 Re: Add os_page_num to pg_buffercache