Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Yuli Khodorkovskiy <yuli(dot)khodorkovskiy(at)crunchydata(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND
Date: 2019-05-20 04:10:17
Message-ID: 1664.1558325417@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Michael Paquier <michael(at)paquier(dot)xyz> writes:
> I have not tested on Windows this one, but on Linux with EXEC_BACKEND
> the test is still not able to detect correctly the failures of the
> syslogger if one reverts 8334515 to re-enable the early syslogger
> startup, so that's a bit disappointing,

[ pokes at that... ] Hah, it proves the syslogger restart logic
works anyway. Because when we restart the crashed syslogger,
we're doing so after shmem exists, so the asserts don't fire.

However, I had a sudden realization about this, which is that
we need to think harder about the question of how the startup
sequence interlocks with the possibility of a pre-existing
postmaster or orphan backends. There's code down inside
CreateDataDirLockFile that attempts to detect a pre-existing
postmaster, but if the postmaster died leaving orphan backends,
that interlock will not detect them. Where we will notice
surviving backends is where we look for a pre-existing shared
memory segment, which is down inside reset_shared.

And: we really should not do anything much to the data directory
until we know that no such old processes remain. Otherwise we
risk problems such as deleting active temp files.

This line of thought suggests that trying to fix things so that
we can launch child processes before creating shared memory
is the wrong thing, because it seriously risks creating problems
in the leftover-child-processes scenario.

This means that the change that 57431a911 wanted to make is only
going to be safe if we're willing to re-order things so that the
startup sequence is

* create datadir lock file
* create shmem
* launch syslogger
* create sockets

Historically we've opened the sockets before making shmem. I'm
not sure offhand if there's any compelling reason for that order
... but if there is, getting 57431a911 to work is a whole lot
trickier than we've been thinking.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message K├Ânig 2019-05-20 05:50:03 problem with latin09 encoding after upgrade to 11.3
Previous Message Michael Paquier 2019-05-20 01:54:45 Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND