Re: VM corruption on standby

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Kirill Reshke <reshkekirill(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: VM corruption on standby
Date: 2025-08-19 13:29:34
Message-ID: fe039a5c-7c15-415c-a082-eaec856b4433@postgrespro.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

19.08.2025 16:17, Kirill Reshke пишет:
> On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
>>
>> This thread is a candidate for [0]
>>
>>
>> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items
>>
>
> Let me summarize this thread for ease of understanding of what's going on:
>
> Timeline:
> 1) Andrey Borodin sends a patch (on 6 Aug) claiming there is
> corruption in VM bits.
> 2) We investigate problem in not with how PostgreSQL modified buffers
> or logs changes, but with LWLockReleaseALl in proc_exit(1) after
> kill-9 PM
> 3) We have reached the conclusion that there is no corruption, and
> that injection points are not a valid way to reproduce them, because
> of WaitLatch and friends.
>
> 4) But we now suspect there is another corruption with ANY critical
> section in scenario:
>
> I wrote:
>
>> Maybe I'm very wrong about this, but I'm currently suspecting there is
>> corruption involving CHECKPOINT, process in CRIT section and kill -9.
>> 1) Some process p1 locks some buffer (name it buf1), enters CRIT
>> section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
>> in (GetXLogBuffer -> AdvanceXLInsertBuffer).
>> 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
>> 3) Postmaster kill-9-ed
>> 4) signal of postmaster death delivered to p1, it wakes up in
>> WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
>> aliveness, and exits releasing all locks.
>> 5) p2 acquires locks on buf1 and flushes it to disk.
>> 6) signal of postmaster death delivered to p2, p2 exits.
>
> 5) We create an open item for pg18 and propose revering
> bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly.

Latch and ConditionVariable (that uses Latch) are among basic
synchronization primitives in PostgreSQL.
Therefore they have to work correctly in any place: in critical section, in
wal logging, etc.
Current behavior of WaitEventSetWaitBlock is certainly the bug and it is
ought to be fixed.
So +1 for _exit(2) as Tom suggested.

--
regards
Yura Sokolov aka funny-falcon

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kirill Reshke 2025-08-19 13:43:33 Re: VM corruption on standby
Previous Message Yura Sokolov 2025-08-19 13:17:46 Re: VM corruption on standby