Two issues leading to discrepancies in FSM data on the standby server

From: Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Two issues leading to discrepancies in FSM data on the standby server
Date: 2026-03-20 01:32:20
Message-ID: 596c4f1c-f966-4512-b9c9-dd8fbcaf0928@postgrespro.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We’ve recently observed a situation with significant increase in
response time for insert operations after switching to a replica server.
The collected information pointed to the discrepancy in the FSM data on
the replica side, which became visible to the insert sessions once
autovacuum process pulled incorrect data from from leaf blocks into FSM
root. The entire situation was looking like the case discussed in
https://postgr.es/m/20180802172857.5skoexsilnjvgruk@alvherre.pgsql and
which was supposed to be fixed by ‘ab7dbd681’ (which introduced FSM
update during 'heap_xlog_visible' invocation). However in our case and
synthetic tests we were able to see data blocks marked as ‘all visible’,
but still having incorrect FSM records.

After analyzing the code I’ve noticed that during recovery FSM data is
updated in XLogRecordPageWithFreeSpace, which uses MarkBufferDirtyHint
to mark FSM block as modified. However, if data checksums are enabled,
then this call does nothing during recovery and is actually a no-op –
basically it just exits immediately without marking block as dirty. The
logic here is that as no new WAL data could not be generated during the
recovery, so changes to hints in block should not mark block as dirty to
avoid risk of torn pages being written. This seems logical, but it seems
not aligned well with the FSM case, as its blocks could be just zeroed
if checksum mismatch is detected. Currently changes to a FSM block could
be lost if each change to the particular FSM block occur rarely enough
to allow its eviction from the cache. To persist the change the
modification need to be performed while FSM block is still kept in
buffers and marked as dirty after receiving its FPI. If block was
already cleaned, then the change won’t be persisted and stored FSM
blocks may remain in an obsolete state. In our case the table had its
'fillfactor' parameter set below 80, so during insert bursts each FSM
block on replica side was modified only during first access of FSM block
since checkpoint (with FPI) and then by processing XLOG_HEAP2_VISIBLE
record for data once it was marked as ‘all visible’. This gives plenty
of time to cleanup buffer between these moments, so the second change
was just never written to the disk. So, large number of blocks were left
with incorrect data in FSM leaf blocks, which caused problem after
switchover.

Given that FSM is ready to handle torn page writes and
XLogRecordPageWithFreeSpace is called only during the recovery there
seems to be no reason to use MarkBufferDirtyHint here instead of a
regular MarkBufferDirty call. The code is already trying to limit
updates to the FSM (i.e. by updating it only after reaching 80% of used
space for regular DML), so we probably want to ensure that these updates
are actually persisted.

The second noticed issue (not related to our observed problem) is
related to the ‘heap_xlog_visible’ – this function uses
‘PageGetFreeSpace’ call instead of ‘PageGetHeapFreeSpace’ to get size of
free space for regular heap blocks. This seems like a bug, as method
'PageGetHeapFreeSpace' is used for any other case where we need to get
free space for a heap page. Usage of incorrect function could also cause
incorrect data being written to the FSM on replica: if block still have
free space, but already reached MaxHeapTuplesPerPage limit, then it
should be marked as unavailable for new rows in FSM, otherwise inserter
will need to check and update its FSM data as well.

Attached are separate patches, which tries to fixes both these problems
– calling ‘MarkBufferDirty’ instead of ‘MarkBufferDirtyHint’ in the
first case and replacing ‘PageGetFreeSpace’ with ‘PageGetHeapFreeSpace’
in the second case.

Two synthetic test cases are also attached which simulates both these
situations – ‘test_case1.zip’ to simulate the problem with lost FSM
update on replica side and ‘test_case2.zip’ to simulate incorrect FSM
data on standby server for blocks with large number of redirect slots.
In both cases the ‘test_prepare.sh’ script could be edited to specify
path to PG installation and port numbers. Then invoke ‘test_preapre.sh’
script to prepare two databases. For first case the second script
‘test_run.sh’ need to be invoked after that to show large number of
blocks being visited for simple insert and for second test case state of
the FSM (for single block) is just displayed at the end of
‘test_prepare.sh’.

Thanks,
Alexey

Attachment Content-Type Size
0001-Mark-modified-FSM-buffer-as-dirty-during-recovery.patch text/x-patch 3.3 KB
0002-Use-PageGetHeapFreeSpace-in-heap_xlog_visible.patch text/x-patch 1.2 KB
test_case1.zip application/zip 5.6 KB
test_case2.zip application/zip 4.4 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2026-03-20 02:15:50 Re: pg_get__*_ddl consolidation
Previous Message David Rowley 2026-03-20 01:18:12 Re: More speedups for tuple deformation