| From: | Christoph Berg <myon(at)debian(dot)org> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: failed NUMA pages inquiry status: Operation not permitted |
| Date: | 2025-12-16 13:16:30 |
| Message-ID: | aUFbrmKrYPBuTZ1c@msg.df7cb.de |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-committers pgsql-hackers |
Re: Tomas Vondra
> Hmmm, strange. -2 is ENOENT, which should mean this:
>
> -ENOENT
> The page is not present.
>
> But what does "not present" mean in this context? And why would that be
> only intermittent? Presumably this is still running in Docker, so maybe
> it's another weird consequence of that?
I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:
while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done
2025-12-16 11:49:35.982 UTC [621807] myon(at)postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon(at)postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa
That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.
I tried reading the kernel source and it sounds related:
* If the source virtual memory range has any unmapped holes, or if
* the destination virtual memory range is not a whole unmapped hole,
* move_pages() will fail respectively with -ENOENT or -EEXIST. This
* provides a very strict behavior to avoid any chance of memory
* corruption going unnoticed if there are userland race conditions.
* Only one thread should resolve the userland page fault at any given
* time for any given faulting address. This means that if two threads
* try to both call move_pages() on the same destination address at the
* same time, the second thread will get an explicit error from this
* command.
...
* The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to
* prevent -ENOENT errors to materialize if there are holes in the
* source virtual range that is being remapped. The holes will be
* accounted as successfully remapped in the retval of the
* command. This is mostly useful to remap hugepage naturally aligned
* virtual regions without knowing if there are transparent hugepage
* in the regions or not, but preventing the risk of having to split
* the hugepmd during the remap.
...
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
unsigned long src_start, unsigned long len, __u64 mode)
...
if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
err = -ENOENT;
break;
What I don't understand yet is why this move_pages() signature does
not match the one from libnuma and move_pages(2) (note "mode" vs "flags"):
int numa_move_pages(int pid, unsigned long count,
void **pages, const int *nodes, int *status, int flags)
{
return move_pages(pid, count, pages, nodes, status, flags);
}
I guess the answer is somewhere in that gap.
> ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
Maybe instead of putting sanity checks on what the kernel is
returning, we should just pass that through to the user? (Or perhaps
transform negative numbers to NULL?)
Christoph
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Christoph Berg | 2025-12-16 14:48:38 | Re: failed NUMA pages inquiry status: Operation not permitted |
| Previous Message | Daniel Gustafsson | 2025-12-16 08:57:04 | pgsql: doc: Update header file mention for CompareType |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Greg Burd | 2025-12-16 13:23:19 | Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ?barriers |
| Previous Message | Anthonin Bonnefoy | 2025-12-16 13:07:58 | Fix possible 'unexpected data beyond EOF' on replica restart |