[PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master

From: "Palamadai, Eka" <ekanatha(at)amazon(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] Replica sends an incorrect epoch in its hot standby feedback to the Master
Date: 2020-02-06 23:51:55
Message-ID: 98BB4805-D0A2-48E1-96F4-15014313EADC@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Postgres hackers,
The below problem occurs in Postgres versions 11, 10, and 9.6. However, it doesn’t occur since Postgres version 12, since the commit [6] to add basic infrastructure for 64-bit transaction IDs indirectly fixed it.
Problem
The replica sends an incorrect epoch in its hot standby feedback to the master in the scenario outlined below, where a checkpoint is interleaved with the execution of 2 transactions at the master. The incorrect epoch in the feedback causes the master to ignore the “oldest Xmin” X sent by the replica. If a heap page prune[1] or vacuum were executed at the master immediately thereafter, they may use a newer “oldest Xmin” Y > X, and prematurely delete a tuple T such that X < t_xmax (T) < Y, which is still in use at the replica as part of a long running read query Q. Subsequently, when the replica replays the deletion of T as part of its WAL replay, it cancels the long running query Q causing unnecessary pain to customers.

Scenario
The experimental setup is to start 2 write transactions and get their transaction ids, then create a checkpoint, and then have the oldest of the 2 transactions perform a write operation. The WAL replay of that write operation breaks the monotonically increasing property of “nextXid” tracked by the replica, and causes it to send a wrong epoch in its next feedback.

1. Set the following parameters on the replica and the master.
* replica:

i. set hot_standby = on and hot_standby_feedback = on to provide hot standby feedback.

ii. set log_min_messages = DEBUG2 so that hot standby feedback[2] gets logged.

* master:

i. set checkpoint_timeout = 3600 (1 hour) to turn off automatic checkpoints during the experiment.

ii. set autovacuum = off to avoid interference from the vacuum process.

1. Make sure no other read/write queries get executed on the replica and the master during this experiment, since they might change the nextXid on the replica. To achieve this, I set up a master and replica running Postgres 11 “locally” on my desktop, with the above parameter settings.
2. Start a psql client to the master, and do the following:

a. Create a test table, say “test_epoch_table”. Execute “CREATE TABLE test_epoch_table(id INT, description CHAR(200));”.

b. Begin a transaction, say A, and get the current transaction ID. Execute “begin; select txid_current();”.

1. Start a second psql client to the master. Begin a transaction, say B, and get the current transaction ID. Execute “begin; select txid_current();”.
2. Start a third psql client to the master, and create a checkpoint. Execute “checkpoint;”.
3. From transaction A, insert a tuple into “test_epoch_table”. Execute “insert into test_epoch_table(id, description) values(1, 'one');”
4. Open the Postgres log file on the replica and look for the latest hot standby feedback. The log message will show that the replica sent an incorrect epoch of 1, instead of 0.

Analysis
The variable “ShmemVariableCache->nextXid” (or “nextXid” for short) should be monotonically increasing unless it wraps around to the next epoch. However, in the above sequence, this property is violated on the replica in the function “RecordKnownAssignedTransactionIds”[3], when the WAL replay for the insertion at step 6 is executed at the replica.

For example, before step 3b is executed, suppose the nextXid at the replica and master is 100, and that the variable “latestObservedXid”, which tracks the latest observed Xid at the replica, is 99. Now, suppose the Xid for transaction A is 100, and that for transaction B is 101. After step 4 is executed, the latestObservedXid at the replica still remains at 99, since the WAL replay for RUNNING_XACTS doesn’t advance latestObservedXid, once the replica reaches the STANDBY_SNAPSHOT_READY state. See [4] for reference. After step 5 is executed, the WAL replay of the checkpoint advances nextXid at the replica to 102, but doesn’t invoke RecordKnownAssignedTransactionIds, and hence latestObservedXid still remains 99. After step 6 is executed, the WAL replay of the insertion invokes RecordKnownAssignedTransactionIds with an input Xid of 100, advances latestObservedXid to 100, and sets nextXid at the replica to 101 (at line [3]) breaking its monotonicity. The hot standby feedback, which is generated immediately after the WAL replay of the insertion, invokes the function “GetNextXidAndEpoch”, which incorrectly calculates the epoch [5].

Versions impacted by this problem: 11, 10, and 9.6

Attached is the git diff of the patch for version 11, which is a 1-line change excluding the curly braces. Patch for versions 10 and 9.6 involve the same change.

Please let me know if you have any questions.

Thanks,
Eka Palamadai
Amazon Web Services

[1] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/heap/pruneheap.c#L74
[2] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/replication/walreceiver.c#L1244
[3] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L3259
[4] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/ipc/procarray.c#L693
[5] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/access/transam/xlog.c#L8444
[6] https://github.com/postgres/postgres/commit/2fc7af5e966043a412e8e69c135fae55a2db6d4f

Attachment Content-Type Size
git_diff application/octet-stream 642 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-02-07 00:00:00 typedef SegmentNumber
Previous Message Thomas Munro 2020-02-06 23:38:27 Getting rid of some more lseek() calls