Re: Adding basic NUMA awareness

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2025-07-04 18:12:01
Message-ID: c6493e5b-76c6-45a5-8067-3a36f4df4fe7@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/4/25 13:05, Jakub Wartak wrote:
> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> Hi!
>
>> 1) v1-0001-NUMA-interleaving-buffers.patch
> [..]
>> It's a bit more complicated, because the patch distributes both the
>> blocks and descriptors, in the same way. So a buffer and it's descriptor
>> always end on the same NUMA node. This is one of the reasons why we need
>> to map larger chunks, because NUMA works on page granularity, and the
>> descriptors are tiny - many fit on a memory page.
>
> Oh, now I get it! OK, let's stick to this one.
>
>> I don't think the splitting would actually make some things simpler, or
>> maybe more flexible - in particular, it'd allow us to enable huge pages
>> only for some regions (like shared buffers), and keep the small pages
>> e.g. for PGPROC. So that'd be good.
>
> You have made assumption that this is good, but small pages (4KB) are
> not hugetlb, and are *swappable* (Transparent HP are swappable too,
> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
> most frequent problem I see these days are OOMs, and it makes me
> believe that making certain critical parts of shared memory being
> swappable just to make pagesize granular is possibly throwing the baby
> out with the bathwater. I'm thinking about bad situations like: some
> wrong settings of vm.swapiness that people keep (or distros keep?) and
> general inability of PG to restrain from allocating more memory in
> some cases.
>

I haven't observed such issues myself, or maybe I didn't realize it's
happening. Maybe it happens, but it'd be good to see some data showing
that, or a reproducer of some sort. But let's say it's real.

I don't think we should use huge pages merely to ensure something is not
swapped out. The "not swappable" is more of a limitation of huge pages,
not an advantage. You can't just choose to make them swappable.

Wouldn't it be better to keep using 4KB pages, but lock the memory using
mlock/mlockall?

>> The other thing I haven't thought about very much is determining on
>> which CPUs/nodes the instance is allowed to run. I assume we'd start by
>> simply inherit/determine that at the start through libnuma, not through
>> some custom PG configuration (which the patch [2] proposed to do).
>
> 0. I think that we could do better, some counter arguments to
> no-configuration-at-all:
>
> a. as Robert & Bertrand already put it there after review: let's say I
> want just to run on NUMA #2 node, so here I would need to override
> systemd's script ExecStart= to include that numactl (not elegant?). I
> could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
> less friendly. Also it probably requires root to edit/reload systemd,
> while having GUC for this like in my proposal makes it more smooth (I
> think?)
>
> b. wouldn't it be better if that stayed as drop-in rather than always
> on? What if there's a problem, how do you disable those internal
> optimizations if they do harm in some cases? (or let's say I want to
> play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
> numa_buffers_interleave would be nice?
>
> c. What if I want my standby (walreceiver+startup/recovery) to run
> with NUMA affinity to get better performance (I'm not going to hack
> around systemd script every time, but I could imagine changing
> numa=X,Y,Z after restart/before promotion)
>
> d. Now if I would be forced for some reason to do that numactl(1)
> voodoo, and use the those above mentioned overrides and PG wouldn't be
> having GUC (let's say I would use `numactl
> --weighted-interleave=0,1`), then:
>

I'm not against doing something like this, but I don't plan to do that
in V1. I don't have a clear idea what configurability is actually
needed, so it's likely I'd do the interface wrong.

>> 2) v1-0002-NUMA-localalloc.patch
>> This simply sets "localalloc" when initializing a backend, so that all
>> memory allocated later is local, not interleaved. Initially this was
>> necessary because the patch set the allocation policy to interleaving
>> before initializing shared memory, and we didn't want to interleave the
>> private memory. But that's no longer the case - the explicit mapping to
>> nodes does not have this issue. I'm keeping the patch for convenience,
>> it allows experimenting with numactl etc.
>
> .. .is not accurate anymore and we would require to have that in
> (still with GUC) ?
> Thoughts? I can add that mine part into Your's patches if you want.
>

I'm sorry, I don't understand what's the question :-(

> Way too quick review and some very fast benchmark probes, I've
> concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
> would be too new topic for me), but let's start:
>
> 1. normal pgbench -S (still with just s_b(at)4GB), done many tries,
> consistent benefit for the patch with like +8..10% boost on generic
> run:
>
> numa_buffers_interleave=off numa_pgproc_interleave=on(due that
> always on "if"), s_b just on 1 NUMA node (might happen)
> latency average = 0.373 ms
> latency stddev = 0.237 ms
> initial connection time = 45.899 ms
> tps = 160242.147877 (without initial connection time)
>
> numa_buffers_interleave=on numa_pgproc_interleave=on
> latency average = 0.345 ms
> latency stddev = 0.373 ms
> initial connection time = 44.485 ms
> tps = 177564.686094 (without initial connection time)
>
> 2. Tested it the same way as I did for mine(problem#2 from Andres's
> presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
> seqconcurrscans.pgb as earlier)
> default/numa_buffers_interleave=off
> latency average = 1375.478 ms
> latency stddev = 1141.423 ms
> initial connection time = 46.104 ms
> tps = 45.868075 (without initial connection time)
>
> numa_buffers_interleave=on
> latency average = 838.128 ms
> latency stddev = 498.787 ms
> initial connection time = 43.437 ms
> tps = 75.413894 (without initial connection time)
>
> and i've repeated the the same test (identical conditions) with my
> patch, got me slightly more juice:
> latency average = 727.717 ms
> latency stddev = 410.767 ms
> initial connection time = 45.119 ms
> tps = 86.844161 (without initial connection time)
>
> (but mine didn't get that boost from normal pgbench as per #1
> pgbench -S -- my numa='all' stays @ 160k TPS just as
> numa_buffers_interleave=off), so this idea is clearly better.

Good, thanks for the testing. I should have done something like this
when I posted my patches, but I forgot about that (and the email felt
too long anyway).

But this actually brings an interesting question. What exactly should we
expect / demand from these patches? In my mind it'd primarily about
predictability and stability of results.

For example, the results should not depend on how was the database
warmed up - was it done by a single backend or many backends? Was it
restarted, or what? I could probably warmup the system very carefully to
ensure it's balanced. The patches mean I don't need to be that careful.

> So should I close https://commitfest.postgresql.org/patch/5703/
> and you'll open a new one or should I just edit the #5703 and alter it
> and add this thread too?
>

Good question. It's probably best to close the original entry as
"withdrawn" and I'll add a new entry. Sounds OK?

> 3. Patch is not calling interleave on PQ shmem, do we want to add that
> in as some next item like v1-0007? Question is whether OS interleaving
> makes sense there ? I believe it does there, please see my thread
> (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
> being spawned by postmaster and may end up on different NUMA nodes
> randomly, so actually OS-interleaving that memory reduces jitter there
> (AKA bandwidth-over-latency). My thinking is that one cannot expect
> static/forced CPU-to-just-one-NUMA-node assignment for backend and
> it's PQ workers, because it is impossible have always available CPU
> power there in that NUMA node, so it might be useful to interleave
> that shared mem there too (as separate patch item?)
>

Excellent question. I haven't thought about this at all. I agree it
probably makes sense to interleave this memory, in some way. I don't
know what's the perfect scheme, though.

wild idea: Would it make sense to pin the workers to the same NUMA node
as the leader? And allocate all memory only from that node?

> 4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
> (also in v1-0005). My worry is should we may put some
> known-limitations docs (?) from start and mention that
> if the VM is greatly resized and NUMA numa nodes appear, they might
> not be used until restart?
>

Yes, this is one thing I need some feedback on. The patches mostly
assume there are no disabled nodes, that the set of allowed nodes does
not change, etc. I think for V1 that's a reasonable limitation.

But let's say we want to relax this a bit. How do we learn about the
change, after a node/CPU gets disabled? For some parts it's not that
difficult (e.g. we can "remap" buffers/descriptors) in the background.
But for other parts that's not practical. E.g. we can't rework how the
PGPROC gets split.

But while discussing this with Andres yesterday, he had an interesting
suggestion - to always use e.g. 8 or 16 partitions, then partition this
by NUMA node. So we'd have 16 partitions, and with 4 nodes the 0-3 would
go to node 0, 4-7 would go to node 1, etc. The advantage is that if a
node gets disabled, we can rebuild just this small "mapping" and not the
16 partitions. And the partitioning may be helpful even without NUMA.

Still have to figure out the details, but seems it might help.

> 5. In v1-0001, pg_numa_interleave_memory()
>
> + * XXX no return value, to make this fail on error, has to use
> + * numa_set_strict
>
> Yes, my patch has those numa_error() and numa_warn() handlers too in
> pg_numa. Feel free to use it for better UX.
>
> + * XXX Should we still touch the memory first, like
> with numa_move_pages,
> + * or is that not necessary?
>
> It's not necessary to touch after numa_tonode_memory() (wrapper around
> numa_interleave_memory()), if it is going to be used anyway it will be
> correctly placed to best of my knowledge.
>
> 6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
>
> Accidental indents (also fails to apply)
>
> 7. We miss the pg_numa_* shims, but for sure that's for later and also
> avoid those Linux specific #ifdef USE_LIBNUMA and so on?
>

Right, we need to add those. Or actually, we need to think about how
we'd do this for non-NUMA systems. I wonder if we even want to just
build everything the "old way" (without the partitions, etc.).

But per the earlier comment, the partitioning seems beneficial even on
non-NUMA systems, so maybe the shims are good enough OK.

> 8. v1-0005 2x + /* if (numa_procs_interleave) */
>
> Ha! it's a TRAP! I've uncommented it because I wanted to try it out
> without it (just by setting GUC off) , but "MyProc->sema" is NULL :
>
> 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
> [..]
> 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
> was terminated by signal 11: Segmentation fault
> 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
> active server processes
> 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
> "restart_after_crash" is off
> 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down
>
> [New LWP 28755]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `postgres: io worker '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 __new_sem_wait_fast (definitive_result=1, sem=sem(at)entry=0x0)
> at ./nptl/sem_waitcommon.c:136
> 136 ./nptl/sem_waitcommon.c: No such file or directory.
> (gdb) where
> #0 __new_sem_wait_fast (definitive_result=1, sem=sem(at)entry=0x0)
> at ./nptl/sem_waitcommon.c:136
> #1 __new_sem_trywait (sem=sem(at)entry=0x0) at ./nptl/sem_wait.c:81
> #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
> ../src/backend/port/posix_sema.c:302
> #3 0x0000556191970553 in InitAuxiliaryProcess () at
> ../src/backend/storage/lmgr/proc.c:992
> #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
> ../src/backend/postmaster/auxprocess.c:65
> #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
> out>, startup_data_len=<optimized out>) at
> ../src/backend/storage/aio/method_worker.c:393
> #6 0x00005561918e8163 in postmaster_child_launch
> (child_type=child_type(at)entry=B_IO_WORKER, child_slot=20086,
> startup_data=startup_data(at)entry=0x0,
> startup_data_len=startup_data_len(at)entry=0,
> client_sock=client_sock(at)entry=0x0) at
> ../src/backend/postmaster/launch_backend.c:290
> #7 0x00005561918ea09a in StartChildProcess
> (type=type(at)entry=B_IO_WORKER) at
> ../src/backend/postmaster/postmaster.c:3973
> #8 0x00005561918ea308 in maybe_adjust_io_workers () at
> ../src/backend/postmaster/postmaster.c:4404
> [..]
> (gdb) print *MyProc->sem
> Cannot access memory at address 0x0
>

Yeah, good catch. I'll look into that next week.

> 9. v1-0006: is this just a thought or serious candidate? I can imagine
> it can easily blow-up with some backends somehow requesting CPUs only
> from one NUMA node, while the second node being idle. Isn't it better
> just to leave CPU scheduling, well, to the CPU scheduler? The problem
> is that you have tools showing overall CPU usage, even mpstat(1) per
> CPU , but no tools for per-NUMA node CPU util%, so it would be hard
> for someone to realize that this is happening.
>

Mostly experimental, for benchmarking etc. I agree we may not want to
mess with the task scheduling too much.

Thanks for the feedback!

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2025-07-04 20:00:00 Re: stats.sql might fail due to shared buffers also used by parallel tests
Previous Message Tom Lane 2025-07-04 17:37:11 Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL