Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] Cascaded standby cannot start after a clean shutdown
Date: 2018-02-16 07:19:00
Message-ID: 20180216071900.GE1174@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 14, 2018 at 04:37:05AM +0000, Tsunakawa, Takayuki wrote:
> The PostgreSQL version is 9.5. The cluster consists of a master, a
> cascading standby (SB1), and a cascaded standby (SB2). The WAL flows
> like this: master -> SB1 -> SB2.
>
> The user shut down SB2 and tried to restart it, but failed with the
> following message:
>
> FATAL: invalid memory alloc request size 3075129344
>
> The master and SB1 continued to run.

This matches a couple of recent bug reports we have seen around like
this one:
https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com
I recalled as well this one but in this case the user shot his own foot
with the failover scenario he designed:
https://www.postgresql.org/message-id/CAAc9rOyKAyzipiq7ee1%3DVbcRy2fRqV_hRujLbC0mrBkL07%3D7wQ%40mail.gmail.com

> CAUSE
> ==============================
>
> total_len in the code below was about 3GB, so palloc() rejected the
> memory allocation request.

Yes, it is limited to 1GB.

> [xlogreader.c]
> record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
> total_len = record->xl_tot_len;
> ...
> /*
> * Enlarge readRecordBuf as needed.
> */
> if (total_len > state->readRecordBufSize &&
> !allocate_recordbuf(state, total_len))
> {
>
> Why was XLogRecord->xl_tot_len so big? That's because the XLOG reader
> read the garbage portion in a WAL file, which is immediately after the
> end of valid WAL records.
>
> Why was there the garbage? Because the cascading standby sends just
> the valid WAL records, not the whole WAL blocks, to the cascaded
> standby. When the cascaded standby receives those WAL records and
> write them in a recycled WAL file, the last WAL block in the file
> contains the mix of valid WAL records and the garbage left over.

Wait a minute here, when recycled past WAL segments would be filled with
zeros before being used.

> Why did the XLOG reader try to allocate memory for a garbage? Doesn't
> it stop reading the WAL? As the following code shows, when the WAL
> record header crosses a WAL block boundary, the XLOG reader first
> allocates memory for the whole record, and then checks the validity of
> the record header as soon as it reads the entire header.
>
> [...]
>
> FIX
> ==============================
>
> One idea is to defer the memory allocation until the entire record
> header is read and ValidXLogRecordHeader() is called. However,
> ValidXLogRecordHeader() might misjudge the validity if the garbage
> contains xl_prev seeming correct.

It seems to me that the consolidation of the page read should happen
directly in xlogreader.c and not even in one of its callbacks so as no
garbage data is presented back to the caller using its own XLogReader.
I think that you need to initialize XLogReaderState->readBuf directly in
ReadPageInternal() before reading a page and you should be good. With
your patch you get visibly only one portion of things addressed, what of
other code paths using xlogreader.c's APIs like pg_rewind, 2PC code and
such?

> FYI, the following unsolved problem may be solved, too.
>
> https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com

Yeah, I noticed this one too before going through your message in
details ;)

Please note that your patch has some unnecessary garbage in two places:
- * Portions Copyright (c) 2010-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2010-2017, PostgreSQL Global Development Group
[...]
- * Only superusers and members of pg_read_all_stats can see details.
- * Other users only get the pid value
+ * Only superusers can see details. Other users only get the pid valu

You may want to indent your patch as well before sending it.

+ if (zbuffer == NULL)
+ zbuffer = palloc0(XLOG_BLCKSZ);
You could just use a static buffer which is MemSet'd with 0 only if
necessary.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2018-02-16 07:28:40 [bug fix] Produce a crash dump before main() on Windows
Previous Message Michael Paquier 2018-02-16 06:54:04 Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()