Re: Two issues leading to discrepancies in FSM data on the standby server

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Two issues leading to discrepancies in FSM data on the standby server
Date: 2026-04-22 12:32:23
Message-ID: CAPpHfdup6dChoejUk0TLkyzy9SVKPFMTUhUXwphP=GsXg-pGUw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 21, 2026 at 5:42 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Tue, Apr 21, 2026 at 9:49 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> >
> > On Tue, Apr 14, 2026 at 7:22 PM Melanie Plageman
> >
> > > Yea, I agree that this seems like simply an oversight in 96ef3b8. And
> > > it seems safe to use MarkBufferDirty() here instead.
> >
> > I also think that usage of MarkBufferDirty() here is safe. If I
> > understood correctly.
> > 1) When wal_log_hints = on, should be completely safe. Even if we
> > have torn page after the crash, during recovery FPI from the primary
> > should come first.
>
> I think FPIs from primary don't really matter here, since we are only
> talking about MarkBufferDirty() in XLogRecordPageWithFreespace(). If
> we change it to MarkBufferDirty() on the standby and the machine
> crashes mid-write leading to a checksum error, we'll just zero it out
> -- which is really your point 2. While FPIs from the primary will
> overwrite the standby's FSM page, they don't provide torn-page
> protection for modifications made by the standby as you could read the
> page between the torn write and replaying any FPI from the primary.

It's probably not so important in this context, but I'd like to verify
my thoughts further. My idea is that standby's changes of FSM are
mirroring primary's changes of FSM, even that FSM changes don't have
own WAL-records and being decoded from other WAL records. Thus, if
some FSM page on primary gets changed then primary emits FPI for the
first change after checkpoint. The standby restartpoints are
synchronized with primary's checkpoints, and the FSM changes mirrors
FSM changes on primary. Standby should also have its first change of
FSM page after the restartpoint covered by FPI received from primary.
So, the consistency of FSM pages should be guaranteed in the similar
way to every other WAL-logged pages, except FSM pages are not directly
WAL-logged, but got their changes decoded from main fork WAL-records.

The weak point I see in the reasoning above is the assumption that FSM
changes on standby fully mirrors FSM changes on primary. I didn't
really check this invariant. But other than that, could you please,
re-check this thoughts and let me know what do you think?

> > 2) When wal_log_hints = off, we can end up with torn pages not covered
> > by FPI. Without checksums, FSM can tolerate torn pages. With
> > checksums, that would result in zeroed pages. FSM can tolerate that
> > as well. And the last shouldn't happen too frequently. So, we should
> > finally get way better FSM state than it is now.
>
> Yes, I think the bottom line is that we can't get checksum errors
> reading FSM pages because of ZERO_ON_ERROR, so there is no reason to
> do MarkBufferDirtyHint() in recovery for FSM. It only leads to losing
> changes to the page.
>
> > Should we push it to all supported branches?
>
> I haven't looked at the code paths in previous versions, but as long
> as they are reading FSM pages with RBM_ZERO_ON_ERROR, I think it is
> safe to do so.

I've checked that since 96ef3b8 we only read FSM pages with RBM_ZERO_ON_ERROR.

> It is a bug that is causing overly optimistic FSM
> numbers, but it's not a correctness issue like wrong results/data
> corruption etc. So, I think you could make an argument either way
> about fixing it.

It has user-visible effect of increased insertion time after replica
promotion. I think this is quite a reason to backpatch.

> I don't know how much of Alexey's reported issue was this vs
> PageGetFreeSpace() in heap_xlog_visible(). The MarkBufferDirty()
> change is easy to fix, so it probably makes sense to do so. I haven't
> investigated more about the PageGetFreeSpace() issue.

Makes sense. I suggest Alexey could clarify this.

------
Regards,
Alexander Korotkov
Supabase

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2026-04-22 12:35:02 RE: Bug in ALTER SUBSCRIPTION ... SERVER / ... CONNECTION with broken old server
Previous Message Amit Kapila 2026-04-22 12:13:01 Re: Get rid of translation strings that only contain punctuation