Re: VM corruption on standby

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: VM corruption on standby
Date: 2025-08-19 23:59:42
Message-ID: CA+hUKGLqaXJJpsxBBNAe4Xk1Sn8yKRxOAQtnVgNQOoLvtdobxA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm inclined to think that we do want to prohibit WaitEventSetWait
> inside a critical section --- it just seems like a bad idea all
> around, even without considering this specific failure mode.

FWIW aio/README.md describes a case where we'd need to wait for an IO,
which might involve a CV to wait for an IO worker to do something, in
order to start writing WAL, which is in a CS. Starting IO involves
calling pgaio_io_acquire(), and if it can't find a handle it calls
pgaio_io_wait_for_free(). That's all hypothetical for now as v18 is
only doing reads, but it's an important architectural principle. That
makes me suspect this new edict can't be the final policy, even if v18
uses it to solve the immediate problem.

For v19 I think we should probably attack the original sin and make
this work. Several mechanisms for unwedging LWLockAcquire() have been
mentioned: (1) the existing LWLockReleaseAll(), which clearly makes
bogus assumptions about system state and cannot stay, (2) some new
thing that would sem_post() all the waiters having set flags that
cause LWLockAcquire() to exit (ie a sort of multiplexing, making our
semaphore-based locks inch towards latch-nature), (3) moving LWLock
over to latches, so the wait would already be multiplexed with PM
death detection, (4) having the parent death signal handler exit
directly (unfortunately Linux and FreeBSD only*), (5) in
multi-threaded prototype work, the whole process exits anyway taking
all backend threads with it** which is a strong motivation to make
multi-process mode act as much like that as possible, eg something the
exits a lot more eagerly and hopefully preemptively than today.

* That's an IRIX invention picked up by Linux and FreeBSD, a sort of
reverse SIGCHLD, and I've tried to recreate it for pure POSIX systems
before. (1) Maybe it's enough for any backend that learns of
postmaster death to signal everyone else since they can't all be
blocked in sig_wait() unless there is already a deadlock. (2) I once
tried making the postmaster deathpipe O_ASYNC so that the "owner" gets
a signal when it becomes readable, but it turned out to require a
separate postmaster pipe for every backend (not just a dup'd
descriptor); perhaps this would be plausible if we already had a
bidirectional postmaster control socket protocol and choose to give
every backend process its own socket pair in MP mode, something I've
been looking into for various other reasons.

** I've been studying the unusual logger case in this context and
contemplated running it as a separate process even in MT mode, as its
stated aim didn't sound crazy to me and I was humbly attempting to
preserve that characteristic in MT mode. Another way to achieve MP/MT
consistency is to decide that the MP design already isn't robust
enough on full pipe and just nuke the logger like everything else.
Reading Andres's earlier comments, I briefly wondered about a
compromise where log senders would make a best effort to send
nonblockingly when they know the postmaster is gone, but that's
neither as reliable as whoever wrote that had in mind (and in their
defence, the logger is basically independent of shared memory state so
whether it should be exiting ASAP or draining final log statements is
at least debatable; barring bugs, it's only going to block progress if
your system is really hosed), nor free entirely of "limping along"
syndrome as Andres argues quite compellingly, so I cancelled that
thought.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-08-20 00:11:06 Re: VM corruption on standby
Previous Message Jeff Davis 2025-08-19 23:34:24 Organize working memory under per-PlanState context