Quick Links

Re: Adding basic NUMA awareness

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Andres Freund <andres(at)anarazel(dot)de>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding basic NUMA awareness
Date:	2025-07-08 12:27:12
Message-ID:	44890599-03d0-43cd-9b7b-7b71ac351337@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 7/8/25 05:04, Andres Freund wrote:
> Hi,
>
> On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
>> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>> I don't think the splitting would actually make some things simpler, or
>>> maybe more flexible - in particular, it'd allow us to enable huge pages
>>> only for some regions (like shared buffers), and keep the small pages
>>> e.g. for PGPROC. So that'd be good.
>>
>> You have made assumption that this is good, but small pages (4KB) are
>> not hugetlb, and are *swappable* (Transparent HP are swappable too,
>> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
>> most frequent problem I see these days are OOMs, and it makes me
>> believe that making certain critical parts of shared memory being
>> swappable just to make pagesize granular is possibly throwing the baby
>> out with the bathwater. I'm thinking about bad situations like: some
>> wrong settings of vm.swapiness that people keep (or distros keep?) and
>> general inability of PG to restrain from allocating more memory in
>> some cases.
>
> The reason it would be advantageous to put something like the procarray onto
> smaller pages is that otherwise the entire procarray (unless particularly
> large) ends up on a single NUMA node, increasing the latency for backends on
> every other numa node and increasing memory traffic on that node.
>

That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.

If we could selectively use 4KB pages for parts of the shared memory,
maybe this wouldn't be necessary. But it's not too annoying.

The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.

regards

--
Tomas Vondra

In response to

Re: Adding basic NUMA awareness at 2025-07-08 03:04:48 from Andres Freund

Responses

Re: Adding basic NUMA awareness at 2025-07-08 12:56:06 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2025-07-08 12:34:59	Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach
Previous Message	cca5507	2025-07-08 11:56:50	Re: Small optimization with expanding dynamic hash table