From: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
---|---|
To: | Kirill Reshke <reshkekirill(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: VM corruption on standby |
Date: | 2025-08-19 13:29:34 |
Message-ID: | fe039a5c-7c15-415c-a082-eaec856b4433@postgrespro.ru |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
19.08.2025 16:17, Kirill Reshke пишет:
> On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
>>
>> This thread is a candidate for [0]
>>
>>
>> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items
>>
>
> Let me summarize this thread for ease of understanding of what's going on:
>
> Timeline:
> 1) Andrey Borodin sends a patch (on 6 Aug) claiming there is
> corruption in VM bits.
> 2) We investigate problem in not with how PostgreSQL modified buffers
> or logs changes, but with LWLockReleaseALl in proc_exit(1) after
> kill-9 PM
> 3) We have reached the conclusion that there is no corruption, and
> that injection points are not a valid way to reproduce them, because
> of WaitLatch and friends.
>
> 4) But we now suspect there is another corruption with ANY critical
> section in scenario:
>
> I wrote:
>
>> Maybe I'm very wrong about this, but I'm currently suspecting there is
>> corruption involving CHECKPOINT, process in CRIT section and kill -9.
>> 1) Some process p1 locks some buffer (name it buf1), enters CRIT
>> section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
>> in (GetXLogBuffer -> AdvanceXLInsertBuffer).
>> 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
>> 3) Postmaster kill-9-ed
>> 4) signal of postmaster death delivered to p1, it wakes up in
>> WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
>> aliveness, and exits releasing all locks.
>> 5) p2 acquires locks on buf1 and flushes it to disk.
>> 6) signal of postmaster death delivered to p2, p2 exits.
>
> 5) We create an open item for pg18 and propose revering
> bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly.
Latch and ConditionVariable (that uses Latch) are among basic
synchronization primitives in PostgreSQL.
Therefore they have to work correctly in any place: in critical section, in
wal logging, etc.
Current behavior of WaitEventSetWaitBlock is certainly the bug and it is
ought to be fixed.
So +1 for _exit(2) as Tom suggested.
--
regards
Yura Sokolov aka funny-falcon
From | Date | Subject | |
---|---|---|---|
Next Message | Kirill Reshke | 2025-08-19 13:43:33 | Re: VM corruption on standby |
Previous Message | Yura Sokolov | 2025-08-19 13:17:46 | Re: VM corruption on standby |