|From:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Subject:||Non-robustness in pmsignal.c|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
As I mentioned in another thread, I came across a reproducible
situation in which a memory clobber in a child backend crashes
the postmaster too, at least on FreeBSD/arm64. Needless to say,
this is Not Cool. I've now traced down what is happening,
and it's this:
1. Careless coding in aset.c causes it to decide to wipe_mem
the universe. (I'll have more to say about that separately;
the point of this thread is keeping the postmaster alive
afterwards.) Apparently, there's not any non-live memory
space between process-local memory and shared memory on this
platform, so the failing backend manages to trash shared memory
too before it finally hits SIGSEGV.
2. Most of the background processes die on something like
TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File: "latch.c", Line: 686, PID: 5916)
or they encounter what seems to be a stuck spinlock. The postmaster,
however, SIGSEGVs. It's not supposed to do that; it is supposed to
be sufficiently arms-length from shared memory that it can recover
despite a backend trashing shared memory contents.
3. The cause of the SIGSEGV is that AssignPostmasterChildSlot
naively believes that it can trust PMSignalState->next_child_flag
to be a valid array index, so after that's been clobbered with
something like 0x7f7f7f7f we index off the end of memory.
I see no good reason for that state variable to be in shared memory
at all, so the attached patch just moves it to postmaster static
data. We also need a less-exposed copy of the array size variable.
4. That's enough to stop the SIGSEGV crash, but the postmaster
still fails to recover, because then it hits
elog(FATAL, "no free slots in PMChildFlags array");
since all of the array entries have been clobbered as well.
In the attached patch, I fixed this by treating the case similarly
to failure to fork a new child process. This seems to be enough
to let the postmaster survive, and recover after it starts noticing
5. It's possible that we should take some proactive steps to get out
of the "no free slots" situation, rather than just wait for some
child to crash. I'm inclined not to, however. It'd be hard-to-test
corner-case code, and given the lack of field reports like this,
the situation must be awfully rare.
regards, tom lane
|Next Message||Andres Freund||2022-10-08 00:28:29||Re: Non-robustness in pmsignal.c|
|Previous Message||Tomas Vondra||2022-10-07 23:43:56||Re: postgres_fdw: using TABLESAMPLE to collect remote sample|