Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] Cascaded standby cannot start after a clean shutdown
Date: 2018-02-22 07:55:38
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 19, 2018 at 03:01:15AM +0000, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:michael(at)paquier(dot)xyz]

Sorry for my late reply. I was looking at this problem for the last
couple of days here and there, still thinking about it.

>> It seems to me that the consolidation of the page read should happen directly
>> in xlogreader.c and not even in one of its callbacks so as no garbage data
>> is presented back to the caller using its own XLogReader.
>> I think that you need to initialize XLogReaderState->readBuf directly in
>> ReadPageInternal() before reading a page and you should be good. With your
>> patch you get visibly only one portion of things addressed, what of other
>> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?
> ReadPageInternal() doesn't know where the end of valid WAL is, so it
> cannot determine where to do memset(0). For example, in non-streaming
> cases, it reads the entire WAL block into readbuf, including the
> garbage.

Why couldn't it know about it? It would be perfectly fine to feed to it
the end LSN position as well and it is an internal API to xlogreader.c.
Note that both its callers, XLogFindNextRecord or XLogReadRecord know
about that as well.

> /*
> * If the current segment is being streamed from master, calculate how
> * much of the current page we have received already. We know the
> * requested record has been received, but this is for the benefit of
> * future calls, to allow quick exit at the top of this function.
> */
> if (readSource == XLOG_FROM_STREAM)
> {
> if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
> readLen = XLOG_BLCKSZ;
> else
> readLen = receivedUpto % XLogSegSize - targetPageOff;
> }
> else
> readLen = XLOG_BLCKSZ;
> So we have to zero-fill the empty space of a WAL block before writing
> it. Currently, the master does that in AdvanceXLInsertBuffer(),
> StartupXLOG(), XLogFileCopy(). The cascading standby receives WAL
> from the master block by block, so it doesn't suffer from the garbage.
> pg_receivexlog zero-fills a new WAL file.

I cannot completely parse this statement. Physical replication sends up
to 16 WAL pages per message, rounded down to the last page completed.
Even a cascading standby sends WAL following this protocol, using the
last flush position as a base for the maximum amount of data sent.

>> + if (zbuffer == NULL)
>> + zbuffer = palloc0(XLOG_BLCKSZ);
>> You could just use a static buffer which is MemSet'd with 0 only if necessary.
> I wanted to allocate memory only when necessary. Currently, only the
> cascaded standby needs this. To that end, I moved the memory
> allocation code to a right place. Thanks for letting me notice this.

Still, it seems to me that any problems are not related to the fact that
we are using cascading standbys. As [1] says as well, similar problems
have been reported on a standby after restarting its primary.

I am definitely ready to buy that it can be possible to have garbage
being read the length field which can cause allocate_recordbuf to fail
as that's the only code path in xlogreader.c which does such an
allocation. Still, it seems to me that we should first try to see if
there are strange allocation patterns that happen and see if it is
possible to have a reproduceable test case or a pattern which gives us
confidence that we are on the right track. One idea I have to
monitor those allocations like the following:
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));

+#ifndef FRONTEND
+ elog(LOG, "Allocation for xlogreader increased to %u", newSize);

This could be upgraded to a PANIC or such if it sees a larger
allocation, as pgbench generates records with known lengths, that would
be a good fit to see if there is some garbage being read. No need to go
up to 1GB before seeing a failure.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2018-02-22 08:14:00 Re: Online enabling of checksums
Previous Message Fabien COELHO 2018-02-22 07:50:31 Re: PATCH: pgbench - break out timing data for initialization phases