Quick Links

Re: VM corruption on standby

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject:	Re: VM corruption on standby
Date:	2025-08-19 15:19:38
Message-ID:	CA+hUKG+9VuVuvABpGiHW3iZ3bkPAs92f3H242SYqDd+JiQo5oQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:
> > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
> > > waiters would get killed due to the postmaster death signal we've configured
> > > (c.f. PostmasterDeathSignalInit()).
> >
> > No, that has a handler that just sets a global variable. That was
> > done because recovery used to try to read() from the postmaster pipe
> > after replaying every record. Also we currently have some places that
> > don't want to be summarily killed (off the top of my head, syncrep
> > wants to send a special error message, and the logger wants to survive
> > longer than everyone else to catch as much output as possible, things
> > I've been thinking about in the context of threads).
>
> That makes no sense. We should just _exit(). If postmaster has been killed,
> trying to stay up longer just makes everything more fragile. Waiting for the
> logger is *exactly* what we should *not* do - what if the logger also crashed?
> There's no postmaster around to start it.

Nobody is waiting for the logger. The logger waits for everyone else
to exit first to collect forensics:

* Unlike all other postmaster child processes, we'll ignore postmaster
* death because we want to collect final log output from all backends and
* then exit last. We'll do that by running until we see EOF on the
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).

The syncrep case is a bit weirder: it wants to tell the user that
syncrep is broken, so its own WaitEventSetWait() has
WL_POSTMASTER_DEATH, but that's basically bogus because the backend
can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code
paths. I've proposed nuking that before.

In response to

Re: VM corruption on standby at 2025-08-19 14:57:43 from Andres Freund

Responses

Re: VM corruption on standby at 2025-08-19 15:24:27 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-08-19 15:24:27	Re: VM corruption on standby
Previous Message	Álvaro Herrera	2025-08-19 15:17:52	Re: pgsql: Move SQL-callable code related to multixacts into its own file