Re: VM corruption on standby

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: VM corruption on standby
Date: 2025-08-19 05:31:36
Message-ID: CA+hUKGJfOGBf55oLsgvv1PZSuJm1+R8yFbVHsP3VnEu=dOqayQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> But I'm of the opinion that proc_exit
> is the wrong thing to use after seeing postmaster death, critical
> section or no. We should assume that system integrity is already
> compromised, and get out as fast as we can with as few side-effects
> as possible. It'll be up to the next generation of postmaster to
> try to clean up.

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

I don't know if there are other ways that LWLockReleaseAll() can lead
to persistent corruption that won't be corrected by crash recovery,
but this one is probably new since the following commit, explaining
the failure to reproduce on v17:

commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
Author: Alexander Korotkov <akorotkov(at)postgresql(dot)org>
Date: Wed Apr 2 12:44:24 2025 +0300

Get rid of WALBufMappingLock

Any idea involving deferring the handling of PM death from here
doesn't seem right: you'd keep waiting for the CV, but the backend
that would wake you might have exited.

Hmm, I wonder if there could be a solution in between where we don't
release the locks on PM exit, but we still wake the waiters so they
can observe a new dead state in the lock word (or perhaps a shared
postmaster_is_dead flag), and exit themselves.

Nice detective work Andrey and others! That's a complicated and rare
interaction.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kirill Reshke 2025-08-19 05:53:06 Re: VM corruption on standby
Previous Message Ajin Cherian 2025-08-19 05:24:57 Re: Improve pg_sync_replication_slots() to wait for primary to advance