[bug fix] Cascaded standby cannot start after a clean shutdown

From: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: [bug fix] Cascaded standby cannot start after a clean shutdown
Date: 2018-02-14 04:37:05
Message-ID: 0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Our customer encountered a rare bug of PostgreSQL which prevents a cascaded standby from starting up. The attached patch is a fix for it. I hope this will be back-patched. I'll add this to the next CF.

PROBLEM
==============================

The PostgreSQL version is 9.5. The cluster consists of a master, a cascading standby (SB1), and a cascaded standby (SB2). The WAL flows like this: master -> SB1 -> SB2.

The user shut down SB2 and tried to restart it, but failed with the following message:

FATAL: invalid memory alloc request size 3075129344

The master and SB1 continued to run.

CAUSE
==============================

total_len in the code below was about 3GB, so palloc() rejected the memory allocation request.

[xlogreader.c]
record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
total_len = record->xl_tot_len;
...
/*
* Enlarge readRecordBuf as needed.
*/
if (total_len > state->readRecordBufSize &&
!allocate_recordbuf(state, total_len))
{

Why was XLogRecord->xl_tot_len so big? That's because the XLOG reader read the garbage portion in a WAL file, which is immediately after the end of valid WAL records.

Why was there the garbage? Because the cascading standby sends just the valid WAL records, not the whole WAL blocks, to the cascaded standby. When the cascaded standby receives those WAL records and write them in a recycled WAL file, the last WAL block in the file contains the mix of valid WAL records and the garbage left over.

Why did the XLOG reader try to allocate memory for a garbage? Doesn't it stop reading the WAL? As the following code shows, when the WAL record header crosses a WAL block boundary, the XLOG reader first allocates memory for the whole record, and then checks the validity of the record header as soon as it reads the entire header.

[xlogreader.c]
/*
* If the whole record header is on this page, validate it immediately.
* Otherwise do just a basic sanity check on xl_tot_len, and validate the
* rest of the header after reading it from the next page. The xl_tot_len
* check is necessary here to ensure that we enter the "Need to reassemble
* record" code path below; otherwise we might fail to apply
* ValidXLogRecordHeader at all.
*/
if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
{
if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
randAccess))
goto err;
gotheader = true;
}
else
{
/* XXX: more validation should be done here */
if (total_len < SizeOfXLogRecord)
{
report_invalid_record(state,
"invalid record length at %X/%X: wanted %u, got %u",
(uint32) (RecPtr >> 32), (uint32) RecPtr,
(uint32) SizeOfXLogRecord, total_len);
goto err;
}
gotheader = false;
}

FIX
==============================

One idea is to defer the memory allocation until the entire record header is read and ValidXLogRecordHeader() is called. However, ValidXLogRecordHeader() might misjudge the validity if the garbage contains xl_prev seeming correct.

So, I modified walreceiver to zero-fill the last partial WAL block. This is the guard that the master does in AdvanceXLInsertBuffer() as below:

/*
* Be sure to re-zero the buffer so that bytes beyond what we've
* written will look like zeroes and not valid XLOG records...
*/
MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

FYI, the following unsolved problem may be solved, too.

https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com

Regards
Takayuki Tsunakawa

Attachment Content-Type Size
zerofill_walblock_on_standby.patch application/octet-stream 2.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-02-14 04:42:08 Re: CALL stmt, ERROR: unrecognized node type: 113 bug
Previous Message Michael Paquier 2018-02-14 04:24:43 Re: TODO item: WAL replay of CREATE TABLESPACE with differing directory structure