Re: Two issues leading to discrepancies in FSM data on the standby server

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Two issues leading to discrepancies in FSM data on the standby server
Date: 2026-04-22 14:15:23
Message-ID: CAAKRu_ZNAsUv9TPJ0bxxfTj-KyTMVidqWUhsfjKDmUrA4DRSkQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 22, 2026 at 8:32 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Tue, Apr 21, 2026 at 5:42 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
>
> > > I also think that usage of MarkBufferDirty() here is safe. If I
> > > understood correctly.
> > > 1) When wal_log_hints = on, should be completely safe. Even if we
> > > have torn page after the crash, during recovery FPI from the primary
> > > should come first.
> >
> > I think FPIs from primary don't really matter here, since we are only
> > talking about MarkBufferDirty() in XLogRecordPageWithFreespace(). If
> > we change it to MarkBufferDirty() on the standby and the machine
> > crashes mid-write leading to a checksum error, we'll just zero it out
> > -- which is really your point 2. While FPIs from the primary will
> > overwrite the standby's FSM page, they don't provide torn-page
> > protection for modifications made by the standby as you could read the
> > page between the torn write and replaying any FPI from the primary.
>
> It's probably not so important in this context, but I'd like to verify
> my thoughts further. My idea is that standby's changes of FSM are
> mirroring primary's changes of FSM, even that FSM changes don't have
> own WAL-records and being decoded from other WAL records. Thus, if
> some FSM page on primary gets changed then primary emits FPI for the
> first change after checkpoint. The standby restartpoints are
> synchronized with primary's checkpoints, and the FSM changes mirrors
> FSM changes on primary. Standby should also have its first change of
> FSM page after the restartpoint covered by FPI received from primary.
> So, the consistency of FSM pages should be guaranteed in the similar
> way to every other WAL-logged pages, except FSM pages are not directly
> WAL-logged, but got their changes decoded from main fork WAL-records.
>
> The weak point I see in the reasoning above is the assumption that FSM
> changes on standby fully mirrors FSM changes on primary. I didn't
> really check this invariant. But other than that, could you please,
> re-check this thoughts and let me know what do you think?

Yea, I don't think they are really being mirrored. The conditions for
updating the FSM in normal operation vs in recovery are different.
When doing an INSERT, we can look for freespace on heap page 5, find
it doesn't have enough and then update the freespace map to reflect
that. Then, let's say we eventually find space for the INSERTed tuple
on heap page 10000. In normal operation, we update the FSM on
rejection, not when we find freespace. So we do not update the FSM for
heap page 10000 to reflect how much freespace it now has remaining.
Later, when replaying the INSERT, if the heuristic conditions are met
(action == BLK_NEEDS_REDO && freespace < BLCKSZ / 5) for freespace on
heap block 10000, we may then update the FSM for page 10000. The local
changes to the FSM page don't mark the buffer dirty and thus aren't
safe from loss if the page is evicted and definitely not safe from
being wiped out by a future FPI since they don't update the FSM page
LSN.

> > It is a bug that is causing overly optimistic FSM
> > numbers, but it's not a correctness issue like wrong results/data
> > corruption etc. So, I think you could make an argument either way
> > about fixing it.
>
> It has user-visible effect of increased insertion time after replica
> promotion. I think this is quite a reason to backpatch.

Right, since it's not invasive, I'm fine with backpatching.

> > I don't know how much of Alexey's reported issue was this vs
> > PageGetFreeSpace() in heap_xlog_visible(). The MarkBufferDirty()
> > change is easy to fix, so it probably makes sense to do so. I haven't
> > investigated more about the PageGetFreeSpace() issue.
>
> Makes sense. I suggest Alexey could clarify this.

Would be good also if he could add a 19 open item for the
PageGetFreeSpace() thing (in the issues affecting back branches
section) or register a patch for next commitfest (as a bug), so we
don't forget about it.

- Melanie

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2026-04-22 14:21:56 Re: RFC: pg_stat_logmsg
Previous Message Yura Sokolov 2026-04-22 13:13:21 Re: PoC: Add condition variable support to WaitEventSetWait()