Re: failed NUMA pages inquiry status: Operation not permitted

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Christoph Berg <myon(at)debian(dot)org>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: failed NUMA pages inquiry status: Operation not permitted
Date: 2026-01-05 21:35:45
Message-ID: b93d876b-67c1-4f0e-b0c5-a4296f09f5b5@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 12/17/25 12:07, Tomas Vondra wrote:
>
>
> On 12/16/25 18:54, Christoph Berg wrote:
>> Re: Tomas Vondra
>>> 1) right after opening a connection, I get this
>>>
>>> test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
>>> numa_node | count
>>> -----------+-------
>>> 0 | 290
>>> -2 | 32478
>>
>> Does that mean that the "touch all pages" logic is missing in some
>> code paths?
>>
>
> I did check and AFAICS we are touching the pages in pg_buffercache_numa.
>
> To make it even more confusing, I can no longer reproduce the behavior I
> reported yesterday. It just consistently reports "0" and I have no idea
> why it changed :-( I did restart since yesterday, so maybe that changed
> something.
>

I kept poking at this, and I managed to reproduce it again. The key
seems to be that the system needs to be under pressure, and then it's
reliably reproducible (at least for me).

What I did is I created two instances - one to keep the system busy, one
for experimentation. The "busy" one is set to use shared_buffers=16GB,
and then running read-only pgbench.

pgbench -i -s 4500 test
pgbench -S -j 16 -c 64 -T 600 -P 1 test

The system has 64GB of RAM and 12 cores, so this is a lot of load.

Then, the other instance is set to use shared_buffers=4GB, is started
and immediately queried for NUMA info for buffers (in a loop):

pg_ctl -D data -l pg.log start;

for r in $(seq 1 10); do
psql -p 5001 test -c 'select numa_node, count(*) from
pg_buffercache_numa group by 1';
done;

pg_ctl -D data -l pg.log stop;

And this often fails like this:

----------------------------------------------------------------------

waiting for server to start.... done
server started
numa_node | count
-----------+---------
0 | 1045302
-2 | 3274
(2 rows)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1048576
(1 row)

numa_node | count
-----------+---------
0 | 1025321
-2 | 23255
(2 rows)

numa_node | count
-----------+---------
0 | 1038596
-2 | 9980
(2 rows)

numa_node | count
-----------+---------
0 | 1048518
-2 | 58
(2 rows)

numa_node | count
-----------+---------
0 | 1048525
-2 | 51
(2 rows)

waiting for server to shut down.... done
server stopped

----------------------------------------------------------------------

So, it clearly fails quite often. And it can fail even later, after a
run that returned no "-2" buffers.

Clearly, something behaves differently than we thought. I've only seen
this happen on a system with swap - once I removed it, this behavior
disappeared too. So it seems a page can be moved to swap, in which case
we get -2 for a status.

In hindsight, that's not all that surprising. It's interesting it can
happen even with the "touching", but I guess there's a race condition
and the memory can get paged out before we inspect the status. We're
querying batches of pages, which probably makes the window larger.

FWIW I now realized I don't even need two instances. If I try this on
the "busy" instance, I get the -2 values too. Which I find a bit weird.
Because why should those be paged out?

The question is what to do about this. I don't think we can prevent the
-2 values, and error-ing out does not seem great either (most systems
have swap, so -2 may not be all that rare).

In fact, pg_shmem_allocations_numa probably should not error-out either,
because it's now reliably failing (on the busy instance).

I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2026-01-05 21:51:39 pgsql: Fix meson build of snowball code.
Previous Message Tom Lane 2026-01-05 20:23:00 pgsql: Update to latest Snowball sources.

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-01-05 21:39:42 Re: Typos in the code and README
Previous Message Peter Smith 2026-01-05 21:18:39 Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE