Re: VM corruption on standby

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: VM corruption on standby
Date: 2025-08-19 13:17:44
Message-ID: CALdSSPhGQ1xx10c2NaZgce8qmi+SuKFp6T1uWG_aZvPpvoJRkQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
>
> This thread is a candidate for [0]
>
>
> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items
>

Let me summarize this thread for ease of understanding of what's going on:

Timeline:
1) Andrey Borodin sends a patch (on 6 Aug) claiming there is
corruption in VM bits.
2) We investigate problem in not with how PostgreSQL modified buffers
or logs changes, but with LWLockReleaseALl in proc_exit(1) after
kill-9 PM
3) We have reached the conclusion that there is no corruption, and
that injection points are not a valid way to reproduce them, because
of WaitLatch and friends.

4) But we now suspect there is another corruption with ANY critical
section in scenario:

I wrote:

> Maybe I'm very wrong about this, but I'm currently suspecting there is
> corruption involving CHECKPOINT, process in CRIT section and kill -9.
>1) Some process p1 locks some buffer (name it buf1), enters CRIT
>section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
>in (GetXLogBuffer -> AdvanceXLInsertBuffer).
>2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
>3) Postmaster kill-9-ed
>4) signal of postmaster death delivered to p1, it wakes up in
>WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
>aliveness, and exits releasing all locks.
>5) p2 acquires locks on buf1 and flushes it to disk.
>6) signal of postmaster death delivered to p2, p2 exits.

5) We create an open item for pg18 and propose revering
bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly.

Please note that patches in this thread are NOT reproducer of
corruption, as of today we have NO valid repro of corruption

--
Best regards,
Kirill Reshke

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yura Sokolov 2025-08-19 13:17:46 Re: VM corruption on standby
Previous Message Andres Freund 2025-08-19 13:09:53 Re: VM corruption on standby