From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Adding basic NUMA awareness |
Date: | 2025-07-08 12:27:12 |
Message-ID: | 44890599-03d0-43cd-9b7b-7b71ac351337@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 7/8/25 05:04, Andres Freund wrote:
> Hi,
>
> On 2025-07-04 13:05:05 +0200, Jakub Wartak wrote:
>> On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>> I don't think the splitting would actually make some things simpler, or
>>> maybe more flexible - in particular, it'd allow us to enable huge pages
>>> only for some regions (like shared buffers), and keep the small pages
>>> e.g. for PGPROC. So that'd be good.
>>
>> You have made assumption that this is good, but small pages (4KB) are
>> not hugetlb, and are *swappable* (Transparent HP are swappable too,
>> manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
>> most frequent problem I see these days are OOMs, and it makes me
>> believe that making certain critical parts of shared memory being
>> swappable just to make pagesize granular is possibly throwing the baby
>> out with the bathwater. I'm thinking about bad situations like: some
>> wrong settings of vm.swapiness that people keep (or distros keep?) and
>> general inability of PG to restrain from allocating more memory in
>> some cases.
>
> The reason it would be advantageous to put something like the procarray onto
> smaller pages is that otherwise the entire procarray (unless particularly
> large) ends up on a single NUMA node, increasing the latency for backends on
> every other numa node and increasing memory traffic on that node.
>
That's why the patch series splits the procarray into multiple pieces,
so that it can be properly distributed on multiple NUMA nodes even with
huge pages. It requires adjusting a couple places accessing the entries,
but it surprised me how limited the impact was.
If we could selectively use 4KB pages for parts of the shared memory,
maybe this wouldn't be necessary. But it's not too annoying.
The thing I'm not sure about is how much this actually helps with the
traffic between node. Sure, if we pick a PGPROC from the same node, and
the task does not get moved, it'll be local traffic. But if the task
moves, there'll be traffic. I don't have any estimates how often this
happens, e.g. for older tasks.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2025-07-08 12:34:59 | Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach |
Previous Message | cca5507 | 2025-07-08 11:56:50 | Re: Small optimization with expanding dynamic hash table |