Re: failed NUMA pages inquiry status: Operation not permitted

From: Christoph Berg <myon(at)debian(dot)org>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: failed NUMA pages inquiry status: Operation not permitted
Date: 2025-12-16 13:16:30
Message-ID: aUFbrmKrYPBuTZ1c@msg.df7cb.de
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Re: Tomas Vondra
> Hmmm, strange. -2 is ENOENT, which should mean this:
>
> -ENOENT
> The page is not present.
>
> But what does "not present" mean in this context? And why would that be
> only intermittent? Presumably this is still running in Docker, so maybe
> it's another weird consequence of that?

I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:

while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done

2025-12-16 11:49:35.982 UTC [621807] myon(at)postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon(at)postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa

That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.

I tried reading the kernel source and it sounds related:

* If the source virtual memory range has any unmapped holes, or if
* the destination virtual memory range is not a whole unmapped hole,
* move_pages() will fail respectively with -ENOENT or -EEXIST. This
* provides a very strict behavior to avoid any chance of memory
* corruption going unnoticed if there are userland race conditions.
* Only one thread should resolve the userland page fault at any given
* time for any given faulting address. This means that if two threads
* try to both call move_pages() on the same destination address at the
* same time, the second thread will get an explicit error from this
* command.
...
* The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to
* prevent -ENOENT errors to materialize if there are holes in the
* source virtual range that is being remapped. The holes will be
* accounted as successfully remapped in the retval of the
* command. This is mostly useful to remap hugepage naturally aligned
* virtual regions without knowing if there are transparent hugepage
* in the regions or not, but preventing the risk of having to split
* the hugepmd during the remap.
...
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
unsigned long src_start, unsigned long len, __u64 mode)
...
if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
err = -ENOENT;
break;

What I don't understand yet is why this move_pages() signature does
not match the one from libnuma and move_pages(2) (note "mode" vs "flags"):

int numa_move_pages(int pid, unsigned long count,
void **pages, const int *nodes, int *status, int flags)
{
return move_pages(pid, count, pages, nodes, status, flags);
}

I guess the answer is somewhere in that gap.

> ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2

Maybe instead of putting sanity checks on what the kernel is
returning, we should just pass that through to the user? (Or perhaps
transform negative numbers to NULL?)

Christoph

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Christoph Berg 2025-12-16 14:48:38 Re: failed NUMA pages inquiry status: Operation not permitted
Previous Message Daniel Gustafsson 2025-12-16 08:57:04 pgsql: doc: Update header file mention for CompareType

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Burd 2025-12-16 13:23:19 Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ?barriers
Previous Message Anthonin Bonnefoy 2025-12-16 13:07:58 Fix possible 'unexpected data beyond EOF' on replica restart