standby promotion can create unreadable WAL

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: standby promotion can create unreadable WAL
Date: 2022-08-22 18:36:36
Message-ID: CA+TgmoY0Lri=fCueg7m_2R_bSspUb1F8OFycEGaHNJw_EUW-=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My colleague Dilip Kumar and I have discovered what I believe to be a
bug in the recently-added "overwrite contrecord" stuff. I'm not sure
whether or not this bug has any serious consequences. I think that
there may be a scenario where it does, but I'm not sure about that.

Suppose you have a primary and a standby, and the standby is promoted
after reading a partial WAL record. The attached script, which was
written by Dilip and slightly modified by me, creates this scenario by
setting up an archiving-only standby, writing a record that crosses a
segment boundary, and then promoting the standby. If you then try to
run pg_waldump on the WAL on timeline 2, it goes boom:

[rhaas pg_wal]$ pg_waldump 000000020000000000000004
000000020000000000000005 2>&1 | tail -n4
rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn:
0/04FFE7B0, prev 0/04FFDFF0, desc: INSERT off 4 flags 0x00, blkref #0:
rel 1663/5/16384 blk 2132
rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn:
0/04FFEF58, prev 0/04FFE7B0, desc: INSERT+INIT off 1 flags 0x00,
blkref #0: rel 1663/5/16384 blk 2133
rmgr: Heap len (rec/tot): 1959/ 1959, tx: 728, lsn:
0/04FFF700, prev 0/04FFEF58, desc: INSERT off 2 flags 0x00, blkref #0:
rel 1663/5/16384 blk 2133
pg_waldump: error: error in WAL record at 0/4FFF700: invalid record
length at 0/4FFFEA8: wanted 24, got 0

What's happening here is that the last WAL segment from timeline 1,
which is 000000010000000000000004, gets copied over to the new
timeline up to the point where the last complete record on that
timeline ends, namely, 0/4FFFEA8. I think that the first record on the
new timeline should be written starting at that LSN, but that's not
what happens. Instead, the rest of that WAL segment remains zeroed,
and the first WAL record on the new timeline is written at the
beginning of the next segment:

[rhaas pg_wal]$ pg_waldump 000000020000000000000005 2>&1 | head -n4
rmgr: XLOG len (rec/tot): 42/ 42, tx: 0, lsn:
0/05000028, prev 0/04FFF700, desc: OVERWRITE_CONTRECORD lsn 0/4FFFEA8;
time 2022-08-22 13:49:22.874435 EDT
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
0/05000058, prev 0/05000028, desc: CHECKPOINT_SHUTDOWN redo 0/5000058;
tli 2; prev tli 1; fpw true; xid 0:729; oid 24576; multi 1; offset 0;
oldest xid 719 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: XLOG len (rec/tot): 30/ 30, tx: 0, lsn:
0/050000D0, prev 0/05000058, desc: NEXTOID 32768
rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn:
0/050000F0, prev 0/050000D0, desc: CREATE base/5/24576

Nothing that uses xlogreader is going to be able to bridge the gap
between file #4 and file #5. In this case it doesn't matter very much,
because we immediately write a checkpoint record into file #5, so if
we crash we won't try to replay file #4 anyway. However, if anything
did try to look at file #4 it would get confused. Maybe that can
happen if this is a streaming standby, where we only write an
end-of-recovery record upon promotion, rather than a checkpoint, or
maybe if there are cascading standbys someone could try to actually
use the 000000020000000000000004 file for something. I'm not sure. But
unless I'm missing something, that file is bogus, and our only hope of
not having problems is that perhaps no one will ever look at it.

I think that the cause of this problem is this code right here:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

It seems to me that this if-statement should also test that the TLI
has not changed i.e. if (newTLI != endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(missingContrecPtr)). If the TLI hasn't changed,
then everything the comment says is correct and I think that what the
code does is also correct. However, if the TLI *has* changed, then I
think we must not advance EndOfLog here, because the WAL that was
copied from the old timeline to the new timeline ends at the point in
the file corresponding to the value of EndOfLog just before executing
this code. When this code then moves EndOfLog forward to the beginning
of the next segment, it leaves the unused portion of the previous
segment as all zeroes, which creates the problem described above.

(Incidentally, there's also a bug in pg_waldump here: it's reporting
the wrong LSN as the source of the error. 0/4FFF700 is not the record
that's busted, as shown by the fact that it was successfully decoded
and shown in the output. The relevant code in pg_waldump should be
using EndRecPtr instead of ReadRecPtr to report the error. If it did,
it would complain about 0/4FFFEA8, which is where the problem really
is. This is of the same vintage as the bug fixed by
d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the
issue was reporting all errors using the start LSN of the first of
several records read no matter where the error actually happened,
whereas in this case the error is using the start LSN of the previous
record instead of the current one.)

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
test_contrecord.sh text/x-sh 1.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2022-08-22 18:43:39 Re: Schema variables - new implementation for Postgres 15
Previous Message Tom Lane 2022-08-22 18:30:22 Re: Change pfree to accept NULL argument