From: | Kirill Reshke <reshkekirill(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: VM corruption on standby |
Date: | 2025-08-19 05:53:06 |
Message-ID: | CALdSSPgDAyqt=ORyLMWMpotb9V4Jk1Am+he39mNtBA8+a8TQDw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi! Thank you for putting attention to this.
On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > But I'm of the opinion that proc_exit
> > is the wrong thing to use after seeing postmaster death, critical
> > section or no. We should assume that system integrity is already
> > compromised, and get out as fast as we can with as few side-effects
> > as possible. It'll be up to the next generation of postmaster to
> > try to clean up.
>
> Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
> someone who holds x calls _exit()?
>
> I don't know if there are other ways that LWLockReleaseAll() can lead
> to persistent corruption that won't be corrected by crash recovery,
> but this one is probably new since the following commit, explaining
> the failure to reproduce on v17:
>
> commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
> Author: Alexander Korotkov <akorotkov(at)postgresql(dot)org>
> Date: Wed Apr 2 12:44:24 2025 +0300
>
> Get rid of WALBufMappingLock
>
> Any idea involving deferring the handling of PM death from here
> doesn't seem right: you'd keep waiting for the CV, but the backend
> that would wake you might have exited.
OK.
> Hmm, I wonder if there could be a solution in between where we don't
> release the locks on PM exit, but we still wake the waiters so they
> can observe a new dead state in the lock word (or perhaps a shared
> postmaster_is_dead flag), and exit themselves.
Since yesterday I was thinking about adding a new state bit for
LWLockWaitState. Something like LW_WS_OWNER_DEAD, which will be set by
lwlock owner after observing PM death and then checked by containers
in LWLockAcquire.
so something like:
*lock holder in proc_exit(1)*
```
for all my lwlock do:
waiter->lwWaiting = LW_WS_OWNER_DEAD;
PGSemaphoreUnlock(waiter->sem);
```
*lock contender in LWLockAttemptLock*
```
old_state = pg_atomic_read_u32(&lock->state);
/* loop until we've determined whether we could acquire the lock or not */
while (true)
{
if (old_state & (1<< LW_WS_OWNER_DEAD)) _exit(2) /* or maybe proc_exit(1)*/
....
if (pg_atomic_compare_exchange_u32(&lock->state, &old_state, desired_state))
...
/*rerty*/
}
```
I am not sure this idea is workable though.
--
Best regards,
Kirill Reshke
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-08-19 06:13:43 | Re: VM corruption on standby |
Previous Message | Thomas Munro | 2025-08-19 05:31:36 | Re: VM corruption on standby |