Re: assertion at postmaster start

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: assertion at postmaster start
Date: 2019-08-24 17:30:39
Message-ID: 32750.1566667839@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2019-06-15 12:09:50 -0400, Alvaro Herrera wrote:
>>> Once in a blue moon I get this assertion failure on server start:
>>> TRAP: FailedAssertion("!(AbortStartTime == 0)", Archivo: "/pgsql/source/master/src/backend/postmaster/postmaster.c", Linea: 2957)

>> And "server process" is afaict only used for actual backends, not other
>> types of processes. But we've not yet seen "database system is ready to
>> accept accept connections", so IIRC it could only be a "dead_end" type
>> backend? But we didn't yet see an error from that...
>> Seems to indicate a logic error in postmaster's state machine. Perhaps
>> something related to dead_end processes?

> So if Andres is guessing right, this must be from something trying to
> connect before the postmaster is ready? Seems like that could be
> tested for ...

I got around to trying to test for this, and I find that I can reproduce
the symptom exactly by applying the attached hack and then trying to
connect a couple of seconds after starting the postmaster.

Basically what seems to be happening in Alvaro's report is that

(1) before the startup process is done, something tries to connect,
causing a dead_end child to be launched;

(2) for reasons mysterious, that child exits with exit code 15 rather
than the expected 0 or 1;

(3) the postmaster therefore initiates a system-wide restart cycle;

(4) the startup process completes normally anyway, indicating that the
SIGQUIT arrived too late to affect it;

(5) then we hit the Assert, since we reach the transition-to-normal-run
code even though HandleChildCrash set AbortStartTime in step (3).

The timing window for (4) to happen is extremely tight normally. The
attached patch makes it wider by the expedient of just not sending the
SIGQUIT to the startup process ;-). Then you just need enough of a delay
in startup to perform a manual connection, plus some hack to make the
dead_end child exit with an unexpected exit code.

I think what this demonstrates is that that Assert is just wrong:
we *can* reach PM_RUN with the flag still set, so we should do

StartupStatus = STARTUP_NOT_RUNNING;
FatalError = false;
- Assert(AbortStartTime == 0);
+ AbortStartTime = 0;
ReachedNormalRunning = true;
pmState = PM_RUN;

Probably likewise for the similar Assert in sigusr1_handler.

A larger question is whether we should modify the postmaster logic
so that crashes of dead_end children aren't treated as reasons to
perform a system restart. I'm dubious about this, because frankly,
such crashes shouldn't be happening. There is very little code
that a dead_end child will traverse before exiting ... so how the
devil did it reach an exit(15)? Alvaro, are you running any
nonstandard code in the postmaster (shared_preload_libraries, maybe)?

regards, tom lane

Attachment Content-Type Size
hack-postmaster-for-startup-failure.patch text/x-diff 1.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-08-24 19:39:31 Re: [PATCH] Make configuration file "include" directive handling more robust
Previous Message Pavel Stehule 2019-08-24 16:13:05 Re: Why overhead of SPI is so large?