Re: Draft for basic NUMA observability

From: Patrick Stählin <me(at)packi(dot)ch>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: Draft for basic NUMA observability
Date: 2025-07-25 18:06:39
Message-ID: 3d8bccef-1395-40ec-bc3d-cccd1882227a@packi.ch
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Jakub

On 7/24/25 10:01 AM, Jakub Wartak wrote:
> On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin <me(at)packi(dot)ch> wrote:
>>
>> Hi!
>>
>> On 4/7/25 11:27 PM, Tomas Vondra wrote:
>>>
>>> I've pushed all three parts of v29, with some additional corrections
>>> (picked lower OIDs, bumped catversion, fixed commit messages).
>>
>> While building the PG18 beta1/2 packages I noticed that in our build
>> containers the selftest for pg_buffercache_numa and numa failed. It
>> seems that libnuma was available and pg_numa_init/numa_available returns
>> no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
>> yielding the following error when accessing
>> pg_buffercache_numa/pg_shmem_allocations_numa:
>>
>> ERROR: failed NUMA pages inquiry: Operation not permitted
>>
>> The man-page of move_pages lead me to believe that this is because of
>> the missing capability CAP_SYS_NICE on the process but I couldn't prove
>> that theory with the attached patch.
>> The patch did make the tests pass but also disabled NUMA permanently on
>> a vanilla Debian VM and that is certainly not wanted. It may well be
>> that my understanding of checking capabilities and how they work is
>> incomplete. I also think that adding a new dependency for the reason of
>> just checking the capability is probably a bit of an overkill, maybe we
>> can check if we can access move_pages once without an error before
>> treating it as one?
>>
>> I'd be happy to debug this further but I have limited access to our
>> build-infra, I should be able to sneak in commands during the build though.
>
>
> Hi Patrick,
>
> So is it because the container was started without CAP_SYS_NICE so
> even root -> postgres is not having this cap? In my book container
> would be rather small and certainly single container wouldn't be
> spanning multiple CPU sockets, so I would just disable libnuma, anyway
> if I do on regular VM:
> [...]

This is just for the build-env but it runs the selftest and this fails
then. The containers this is running in prod is a totally different
setup and there the numa calls actually work. Disabling it may be an
option but it would be nice to detect that we can't access it at runtime.

> Can you provide exact details about this container technology?

We use podman to set everything up.

> Can you provide /usr/sbin/capsh --print just before starting PG there?
> Maybe this is more cgroup/cpuset somehow related too?

Here is the output, it seems that cap_sys_nice is missing from the
bounding set:

+ /usr/sbin/capsh --print
Current: =
Bounding set
=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
Ambient set =
Current IAB:
!cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_mknod,!cap_lease,!cap_audit_write,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
secure-no-ambient-raise: no (unlocked)
uid=2000(buildkite-agent) euid=2000(buildkite-agent)
gid=2000(buildkite-agent)
groups=2000(buildkite-agent)
Guessed mode: HYBRID (4)

> Anyway, there is a simpler way to make the tests pass if that's what
> you are after. We do have
> contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected
> to match outputs in pg_buffercache_numa.out OR (!)
> pg_buffercache_numa_1.out. We could just handle this edge case by
> adding pg_buffercache_numa_2.out too probably (which would just
> contain semi-valid scenario for "ERROR: failed NUMA pages inquiry:
> Operation not permitted")

Ah, didn't know that was a possibility. Until this sees more usage than
just querying the state, this may be a nice workaround. If this is more
wide-spread we probably need something a bit more robust for the
detection. I already patch out the tests for our build-env so for me
it's "solved" but that is certainly not a proper solution.

Just FYI, I'll be on PTO so I won't have access to the build-env in the
next two weeks.

Thanks,
Patrick

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-07-25 18:21:26 Re: Regression with large XML data input
Previous Message Robert Treat 2025-07-25 18:02:47 Re: Regression with large XML data input