Re: Adding basic NUMA awareness

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2025-07-25 10:27:11
Message-ID: CAKZiRmxemXXVAouzM4Ls7pA7e0u6CVSLJeL_phKkmGPOzvUv_g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 7/4/25 20:12, Tomas Vondra wrote:
> > On 7/4/25 13:05, Jakub Wartak wrote:
> >> ...
> >>
> >> 8. v1-0005 2x + /* if (numa_procs_interleave) */
> >>
> >> Ha! it's a TRAP! I've uncommented it because I wanted to try it out
> >> without it (just by setting GUC off) , but "MyProc->sema" is NULL :
> >>
> >> 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL
> >> 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
> >> [..]
> >> 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755)
> >> was terminated by signal 11: Segmentation fault
> >> 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other
> >> active server processes
> >> 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because
> >> "restart_after_crash" is off
> >> 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down
> >>
> >> [New LWP 28755]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> >> Core was generated by `postgres: io worker '.
> >> Program terminated with signal SIGSEGV, Segmentation fault.
> >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem(at)entry=0x0)
> >> at ./nptl/sem_waitcommon.c:136
> >> 136 ./nptl/sem_waitcommon.c: No such file or directory.
> >> (gdb) where
> >> #0 __new_sem_wait_fast (definitive_result=1, sem=sem(at)entry=0x0)
> >> at ./nptl/sem_waitcommon.c:136
> >> #1 __new_sem_trywait (sem=sem(at)entry=0x0) at ./nptl/sem_wait.c:81
> >> #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
> >> ../src/backend/port/posix_sema.c:302
> >> #3 0x0000556191970553 in InitAuxiliaryProcess () at
> >> ../src/backend/storage/lmgr/proc.c:992
> >> #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
> >> ../src/backend/postmaster/auxprocess.c:65
> >> #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized
> >> out>, startup_data_len=<optimized out>) at
> >> ../src/backend/storage/aio/method_worker.c:393
> >> #6 0x00005561918e8163 in postmaster_child_launch
> >> (child_type=child_type(at)entry=B_IO_WORKER, child_slot=20086,
> >> startup_data=startup_data(at)entry=0x0,
> >> startup_data_len=startup_data_len(at)entry=0,
> >> client_sock=client_sock(at)entry=0x0) at
> >> ../src/backend/postmaster/launch_backend.c:290
> >> #7 0x00005561918ea09a in StartChildProcess
> >> (type=type(at)entry=B_IO_WORKER) at
> >> ../src/backend/postmaster/postmaster.c:3973
> >> #8 0x00005561918ea308 in maybe_adjust_io_workers () at
> >> ../src/backend/postmaster/postmaster.c:4404
> >> [..]
> >> (gdb) print *MyProc->sem
> >> Cannot access memory at address 0x0
> >>
> >
> > Yeah, good catch. I'll look into that next week.
> >
>
> I've been unable to reproduce this issue, but I'm not sure what settings
> you actually used for this instance. Can you give me more details how to
> reproduce this?

Better late than never, well feel free to partially ignore me, i've
missed that it is known issue as per FIXME there, but I would just rip
out that commented out `if(numa_proc_interleave)` from
FastPathLockShmemSize() and PGProcShmemSize() unless you want to save
those memory pages of course (in case of no-NUMA). If you do want to
save those pages I think we have problem:

For complete picture, steps:

1. patch -p1 < v2-0001-NUMA-interleaving-buffers.patch
2. patch -p1 < v2-0006-NUMA-interleave-PGPROC-entries.patch

BTW the pgbench accidentinal ident is still there (part of v2-0001 patch))
14 out of 14 hunks FAILED -- saving rejects to file
src/bin/pgbench/pgbench.c.rej

3. As I'm just applying 0001 and 0006, I've got two simple rejects,
but fixed it (due to not applying missing numa_ freelist patches).
That's intentional on my part, because I wanted to play just with
those two.

4. Then I uncomment those two "if (numa_procs_interleave)" related for
optional memory shm initialization - add_size() and so on (that have
XXX comment above that it is causing bootstrap issues)

5. initdb with numa_procs_interleave=on, huge_pages = on (!), start, it is ok

6. restart with numa_procs_interleave=off, which gets me to every bg
worker crashing e.g.:

(gdb) where
#0 __new_sem_wait_fast (definitive_result=1, sem=sem(at)entry=0x0) at
./nptl/sem_waitcommon.c:136
#1 __new_sem_trywait (sem=sem(at)entry=0x0) at ./nptl/sem_wait.c:81
#2 0x0000563e2d6e4d5c in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
#3 0x0000563e2d774d93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
#4 0x0000563e2d6e9252 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
#5 0x0000563e2d6eb683 in CheckpointerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/postmaster/checkpointer.c:190
#6 0x0000563e2d6ec363 in postmaster_child_launch
(child_type=child_type(at)entry=B_CHECKPOINTER, child_slot=249,
startup_data=startup_data(at)entry=0x0,
startup_data_len=startup_data_len(at)entry=0,
client_sock=client_sock(at)entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
#7 0x0000563e2d6ee29a in StartChildProcess
(type=type(at)entry=B_CHECKPOINTER) at
../src/backend/postmaster/postmaster.c:3973
#8 0x0000563e2d6f17a6 in PostmasterMain (argc=argc(at)entry=3,
argv=argv(at)entry=0x563e377cc0e0) at
../src/backend/postmaster/postmaster.c:1386
#9 0x0000563e2d4948fc in main (argc=3, argv=0x563e377cc0e0) at
../src/backend/main/main.c:231

notice sema=0x0, because:
#3 0x000056050928cd93 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:995
995 PGSemaphoreReset(MyProc->sem);
(gdb) print MyProc
$1 = (PGPROC *) 0x7f09a0c013b0
(gdb) print MyProc->sem
$2 = (PGSemaphore) 0x0

or with printfs:

2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal
PGPROC=0x7f9de827b880 requestSize=148770
// after proc && ptr manipulation:
2025-07-25 11:17:23.683 CEST [21772] LOG: in InitProcGlobal
PGPROC=0x7f9de827bdf0 requestSize=148770 procs=0x7f9de827b880
ptr=0x7f9de827bdf0
[..initialization of aux PGPROCs i=0.., still fromInitProcGlobal(),
each gets proper sem allocated as one would expect:]
[..for i loop:]
2025-07-25 11:17:23.689 CEST [21772] LOG: i=136 ,
proc=0x7f9de8600000, proc->sem=0x7f9da4e04438
2025-07-25 11:17:23.689 CEST [21772] LOG: i=137 ,
proc=0x7f9de8600348, proc->sem=0x7f9da4e044b8
2025-07-25 11:17:23.689 CEST [21772] LOG: i=138 ,
proc=0x7f9de8600690, proc->sem=0x7f9da4e04538
[..but then in the children codepaths, out of the blue in
InitAuxilaryProcess the whole MyProc looks like it would memsetted to
zeros:]
2025-07-25 11:17:23.693 CEST [21784] LOG: auxiliary process using
MyProc=0x7f9de8600000 auxproc=0x7f9de8600000 proctype=0
MyProcPid=21784 MyProc->sem=(nil)

above got pgproc slot i=136 with addr 0x7f9de8600000 and later that
auxiliary is launched but somehow something NULLified ->sem there
(according to gdb , everything is zero there)

7. Original patch v2-0006 (with commented out 2x if
numa_procs_interleave), behaves OK, so in my case here with 1x NUMA
node that gives add_size(.., 1+1 * 2MB)=4MB

2025-07-25 11:38:54.131 CEST [23939] LOG: in InitProcGlobal
PGPROC=0x7f25cbe7b880 requestSize=4343074
2025-07-25 11:38:54.132 CEST [23939] LOG: in InitProcGlobal
PGPROC=0x7f25cbe7bdf0 requestSize=4343074 procs=0x7f25cbe7b880
ptr=0x7f25cbe7bdf0

so something is zeroing out all those MyProc structures apparently on
startup (probably due to some wrong alignment maybe somewhere ?) I was
thinking about trapping via mprotect() this single i=136
0x7f9de8600000 PGPROC to see what is resetting it, but oh well,
mprotect() works only on whole pages...

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2025-07-25 10:28:28 Re: More protocol.h replacements this time into walsender.c
Previous Message Álvaro Herrera 2025-07-25 10:25:15 Re: trivial grammar refactor