From ea074ecb20e88dab91c08dc8aa77eb268de10585 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev <michail.nikolaev@gmail.com>
Date: Sun, 23 Jan 2022 20:47:56 +0300
Subject: [PATCH v9 3/3] docs

---
 src/backend/access/nbtree/README | 35 ++++++++++++++++++++++----------
 src/backend/storage/page/README  |  8 +++++---
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 5529afc1fe..a52936cea4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -734,17 +734,30 @@ lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
-During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
-on the standby server can be older than the oldest xmin on the primary
-server, which means tuples can be marked LP_DEAD even when they are
-still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
-they can still appear in the standby because of full page writes. So
-we must always ignore them in standby, and that means it's not worth
-setting them either.  (When LP_DEAD-marked tuples are eventually deleted
-on the primary, the deletion is WAL-logged.  Queries that run on a
-standby therefore get much of the benefit of any LP_DEAD setting that
-takes place on the primary.)
+There is some complexity in using LP_DEAD bits during recovery. Generally,
+bits could be set and read by scan, but there is a possibility to meet
+the bit applied on the primary. We don't WAL log tuple LP_DEAD bits, but
+they can still appear on the standby because of the full-page writes. Such
+a cause could cause MVCC failures because the oldest xmin on the standby
+server can be older than the oldest xmin on the primary server, which means
+tuples can be marked LP_DEAD even when they are still visible on the standby.
+
+To prevent such failure, we mark pages with LP_DEAD bits set by standby with a
+special flag. In the case of FPW from primary the flag is always cleared while
+applying the full page write, so, LP_DEAD received from primary is ignored on
+standby. Also, standby clears all LP_DEAD set by primary on the page before
+setting of own bits.
+
+There are restrictions on settings LP_DEAD bits by the standby related to
+minRecoveryPoint value. In case of crash recovery standby will start to process
+queries after replaying WAL to minRecoveryPoint position (some kind of rewind to
+the previous state). A the same time setting of LP_DEAD bits are not protected
+by WAL in any way. So, to mark tuple as dead we must be sure it was "killed"
+before minRecoveryPoint (comparing the LSN of commit record). Another valid
+option is to compare "killer" LSN with index page LSN because minRecoveryPoint
+would be moved forward when the index page flushed. Also, in some cases xid of
+"killer" is unknown - for example, tuples were cleared by XLOG_HEAP2_PRUNE.
+In that case, we compare the LSN of the heap page to index page LSN.
 
 Note that we talk about scans that are started during recovery. We go to
 a little trouble to allow a scan to start during recovery and end during
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
index e30d7ac59a..1fd0cb29cb 100644
--- a/src/backend/storage/page/README
+++ b/src/backend/storage/page/README
@@ -59,6 +59,8 @@ even if it is a very bad thing for the user.
 New WAL records cannot be written during recovery, so hint bits set during
 recovery must not dirty the page if the buffer is not already dirty, when
 checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
-being set, but with checksums enabled, a page cannot be dirtied after setting a
-hint bit (due to the torn page risk). So, it must wait for full-page images
-containing the hint bit updates to arrive from the primary.
+being set, but with checksums enabled, a page cannot be dirtied because setting
+a hint bit (due to the torn page risk). So, it must wait for full-page images
+containing the hint bit updates to arrive from the primary. But if the page is
+already dirty or dirtied later by WAL replay - hint bits may be flushed on
+standby. Also, as result, checksums on primary and standby could differ.
-- 
2.33.1