Re: Why clearing the VM doesn't require registering vm buffer in wal record

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Why clearing the VM doesn't require registering vm buffer in wal record
Date: 2026-03-12 17:05:20
Message-ID: t7xd6gjxdxpzdzr2uamfbyi5rlkdsikiywmtmgwrwcsr77xyw7@i3qhbzy5rpzx
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-03-06 11:49:20 -0500, Robert Haas wrote:
> On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Which perhaps also should have emitted an FPI when clearing a bit? But I'm
> > unsure that that was required at the time. OTOH, it did seem to generate an
> > FPI for setting a VM bit, so ...
>
> After booting up the part of my brain that worked on this back in
> 2011, I think I reasoned that clearing a bit was idempotent and didn't
> depend on the prior page state, so that we would be adequate protected
> without an FPI. I had to add a WAL record to *set* the VM bit, because
> that wasn't logged at all, but the actual of clearing the VM bit
> piggybacked on the heap change that necessitated doing so, and having
> that record also emit an FPI for the VM page didn't seem to me to add
> anything.
>
> So I think technically it's the addition of checksums that turns this
> into a real bug, because now torn pages are a checksum-failure hazard.
> However, that would have been extremely hard to notice given that I
> seem never to have actually documented that reasoning in a comment
> anywhere.

And then it got a lot worse with incremental backup support...

> I don't think we should mess around with trying to make the behavior
> here conditional on wal_level, checksums, etc. We should just do it
> all the time to keep the logic simple.

Agreed.

I've been working on a fix for this (still pretty raw though). What a mess.

As part of validating that I added error checks that disallow reads in the
startup process for anything other than the FSM fork. Which triggered, due to
visibilitymap_prepare_truncate() reading a VM page.

Which in turn made me wonder: Is it actually OK for
visibilitymap_prepare_truncate() to do separate WAL logging from
XLOG_SMGR_TRUNCATE?

I suspect not, consider this scenario:

1) primary does XLOG_FPI for the VM page
2) checkpoint
3) primary does XLOG_SMGR_TRUNCATE
4) standby replays XLOG_FPI
5) primary performs a restartpoint
6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
visibilitymap_prepare_truncate, which dirties the VM page
7) standby crashes while writing out the VM page
8) standby does recovery and now finds a torn VM page, triggering a checksum
failure

Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
wipe out that region of the VM?

Separately, is it actually sufficient that visibilitymap_prepare_truncate()
only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth
in my head about danger scenarios involving PITR or crashes inbetween
prepare_truncate() and the actual truncation.

ISTM that we should not do visibilitymap_prepare_truncate() during recovery
and always WAL log the prepare_truncate() on the primary. Does that sound
sane?

Greetings,

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2026-03-12 17:15:39 Re: pg_plan_advice
Previous Message Robert Haas 2026-03-12 17:03:21 Re: Better shared data structure management and resizable shared data structures