From: | Kirill Reshke <reshkekirill(at)gmail(dot)com> |
---|---|
To: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: VM corruption on standby |
Date: | 2025-08-12 05:38:04 |
Message-ID: | CALdSSPgo7=UrgNqJUhPdAimSXS-ZuHOOEbtOH__CH3NUS3G4_A@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
> Hi hackers!
>
> I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
> At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
> So I toyed around and accidentally wrote a test that reproduces $subj.
>
> I think the corruption happens as follows:
> 0. we create a table with one frozen tuple
> 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
> 2. VM buffer is flushed on disk with checkpointer or bgwriter
> 3. primary is killed with -9
> now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
> 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
> 5. pg_visibility detects corruption
>
> Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:
>
> /* Clear only the all-frozen bit on visibility map if needed */
> if (PageIsAllVisible(page) &&
> visibilitymap_clear(relation, block, vmbuffer,
> VISIBILITYMAP_ALL_FROZEN))
> cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash
>
> Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.
>
> The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.
>
> Thanks!
>
>
> Best regards, Andrey Borodin.
>
>
>
Attached reproduces the same but without any standby node. CHECKPOINT
somehow manages to flush the heap page when instance kill-9-ed.
As a result, we have inconsistency between heap and VM pages:
```
reshke=# select * from pg_visibility('x');
blkno | all_visible | all_frozen | pd_all_visible
-------+-------------+------------+----------------
0 | t | t | f
(1 row)
```
Notice I moved INJECTION point one line above visibilitymap_clear.
Without this change, such behaviour also reproduced, but with much
less frequency.
--
Best regards,
Kirill Reshke
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Corrupt-VM-on-standby.patch | application/octet-stream | 10.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Chao Li | 2025-08-12 06:05:39 | Re: GB18030-2022 Support in PostgreSQL |
Previous Message | Thomas Munro | 2025-08-12 05:06:47 | Re: index prefetching |