Quick Links

Re: Adding basic NUMA awareness

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding basic NUMA awareness
Date:	2025-07-30 08:29:06
Message-ID:	CAKZiRmx3+GwaP3oiRVHavxCJh6KxhZVZp86kj50ZJAv51h2-gQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:

Hi Tomas,

just a quick look here:

> 2) The PGPROC part introduces a similar registry, [..]
>
> There's also a view pg_buffercache_pgproc. The pg_buffercache location
> is a bit bogus - it has nothing to do with buffers, but it was good
> enough for now.

If you are looking for better names: pg_shmem_pgproc_numa would sound
like a more natural name.

> 3) The PGPROC partitioning is reworked and should fix the crash with the
> GUC set to "off".

Thanks!

> simple benchmark
> ----------------
[..]
> There's results for the three "pgbench pinning" strategies, and that can
> have pretty significant impact (colocated generally performs much better
> than either "none" or "random").

Hint: real world is that network cards are usually located on some PCI
slot that is assigned to certain node (so traffic is flowing from/to
there), so probably it would make some sense to put pgbench outside
this machine and remove this as "variable" anyway and remove the need
for that pgbench --pin-cpus in script. In optimal conditions: most
optimized layout would be probably to have 2 cards on separate PCI
slots, each for different node and some LACP between those, with
xmit_hash_policy allowing traffic distribution on both of those cards
-- usually there's not just single IP/MAC out there talking to/from
such server, so that would be real-world (or lack of) affinity.

Also classic pgbench workload, seems to be poor fit for testing it out
(at least v3-0001 buffers), there I would propose sticking to just
lots of big (~s_b size) full table seq scans to put stress on shared
memory. Classic pgbench is usually not there enough to put serious
bandwidth on the interconnect by my measurements.

> For the "bigger" machine (wiuth 176 cores) the incremental results look
> like this (for pinning=none, i.e. regular pgbench):
>
>
> mode s_b buffers localal no-tail freelist sweep pgproc pinning
> ====================================================================
> prepared 16GB 99% 101% 100% 103% 111% 99% 102%
> 32GB 98% 102% 99% 103% 107% 101% 112%
> 8GB 97% 102% 100% 102% 101% 101% 106%
> --------------------------------------------------------------------
> simple 16GB 100% 100% 99% 105% 108% 99% 108%
> 32GB 98% 101% 100% 103% 100% 101% 97%
> 8GB 100% 100% 101% 99% 100% 104% 104%
>
> The way I read this is that the first three patches have about no impact
> on throughput. Then freelist partitioning and (especially) clocksweep
> partitioning can help quite a bit. pgproc is again close to ~0%, and
> PGPROC pinning can help again (but this part is merely experimental).

Isn't the "pinning" column representing just numa_procs_pin=on ?
(shouldn't it be tested with numa_procs_interleave = on?)

[..]
> To quantify this kind of improvement, I think we'll need tests that
> intentionally cause (or try to) imbalance. If you have ideas for such
> tests, let me know.

Some ideas:
1. concurrent seq scans hitting s_b-sized table
2. one single giant PX-enabled seq scan with $VCPU workers (stresses
the importance of interleaving dynamic shm for workers)
3. select txid_current() with -M prepared?

> reserving number of huge pages
> ------------------------------
[..]
> It took me ages to realize what's happening, but it's very simple. The
> nr_hugepages is a global limit, but it's also translated into limits for
> each NUMA node. So when you write 16828 to it, in a 4-node system each
> node gets 1/4 of that. See
>
> $ numastat -cm
>
> Then we do the mmap(), and everything looks great, because there really
> is enough huge pages and the system can allocate memory from any NUMA
> node it needs.

Yup, similiar story as with OOMs just for per-zone/node.

> And then we come around, and do the numa_tonode_memory(). And that's
> where the issues start, because AFAIK this does not check the per-node
> limit of huge pages in any way. It just appears to work. And then later,
> when we finally touch the buffer, it tries to actually allocate the
> memory on the node, and realizes there's not enough huge pages. And
> triggers the SIGBUS.

I think that's why options for strict policy numa allocation exist and
I had the option to use it in my patches (anyway with one big call to
numa_interleave_memory() for everything it was much more simpler and
just not micromanaging things). Good reads are numa(3) but e.g.
mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
but it could be happening for you too if you start using it). Anyway,
numa_set_strict() is just wrapper around setting this exact flag

Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
that should be always called in your patch series to pre-populate
everything during startup, so that others testing will get proper
guaranteed layout, even without issuing any pg_buffercache calls.

> The only way around this I found is by inflating the number of huge
> pages, significantly above the shared_memory_size_in_huge_pages value.
> Just to make sure the nodes get enough huge pages.
>
> I don't know what to do about this. It's quite annoying. If we only used
> huge pages for the partitioned parts, this wouldn't be a problem.

Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
NUMA affinity, seems like a logical trade-off, doesn't it?
But postgres -C shared_memory_size_in_huge_pages still works OK to
establish the exact count for vm.nr_hugepages, right?

Regards,
-J.

In response to

Re: Adding basic NUMA awareness at 2025-07-28 14:19:07 from Tomas Vondra

Responses

Re: Adding basic NUMA awareness at 2025-07-30 11:10:45 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fabrice Chapuis	2025-07-30 08:33:25	Re: pg_basebackup and pg_switch_wal()
Previous Message	shveta malik	2025-07-30 08:29:00	Re: Logical Replication of sequences