Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Date: 2013-01-17 22:07:28
Message-ID: 20130117220728.GA30799@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-01-17 23:49:22 +0200, Heikki Linnakangas wrote:
> On 17.01.2013 21:57, Heikki Linnakangas wrote:
> >On 17.01.2013 20:08, Andres Freund wrote:
> >>On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:
> >>>I encountered the problem that the timeline switch is not performed
> >>>expectedly.
> >>>I set up one master, one standby and one cascade standby. All the
> >>>servers
> >>>share the archive directory. restore_command is specified in the
> >>>recovery.conf
> >>>in those two standbys.
> >>>
> >>>I shut down the master, and then promoted the standby. In this case, the
> >>>cascade standby should switch to new timeline and replication should be
> >>>successfully restarted. But the timeline was never changed, and the
> >>>following
> >>>log messages were kept outputting.
> >>>
> >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
> >>>sby2 LOG: replication terminated by primary server
> >>>sby2 DETAIL: End of WAL reached on timeline 1
> >>>....
> >>
> >>That's after the commit or before? Because in passing I think I
> >>noticed/fixed a bug that could cause exactly that problem...
> >
> >I think I broke that with the "teach pg_receivexlog to cross timelines"
> >patch. Will take a look...
>
> Ok, there was a couple of issues. First, as I suspected above, I added a new
> result set after copy has ended in START_STREAMING command, but forgot to
> teach walreceiver about it. Fixed that.
>
> Secondly, there's an interesting interaction between the new xlogreader code
> and streaming replication and timeline switches:
>
> The first call to the page_read callback in xlogreader is always made with
> size SizeOfXLogRecord, counting from the beginning of the page. That's
> bogus; there can be no WAL record in the very beginning of page, because of
> the page header, so I think that was supposed to be SizeXLogShortPHD. But
> that won't fix the issue.

Yea, thats clearly a typo.

> The problem is that XLogPageRead callback uses the page address and
> requested length to decide what timeline to stream from. For example,
> imagine that there's a timeline switch at offset 1000 in a page, and we have
> already read all the WAL up to that point. When xlogreader first asks to
> fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord),
> XLogPageRead deduces that that byte range is still on the old timeline, and
> starts streaming from the old timeline. Next, xlogreader needs the rest of
> the page, up to 1000 + SizeOfPageHeader, to read the first record it's
> actually interested in, XLogPageRead will return an error because that range
> is not on the timeline that's currently streamed. And we loop back to retry,
> and run into the same problem again.

> This interaction is a bit too subtle for my taste, but the straightforward
> fix is to just modify xlogreader so that the first read_page call requests
> all the bytes up to record we're actually interested in. That seems like a
> smart thing to do anyway.

Yuck. Reading from targetRecOff onwards seems like a good idea, yes. But
ISTM that the code isn't really safe now: Assume targetRecOff is 0
(which is valid) then we read up to SizeOfXLogRecord but assume that we
can read both the xlog record as well as the page header.
Then you also need to factor in that the page header can be differently
big.

And yes, its too subtle :(

Do you want to fix that or shall I?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2013-01-17 22:08:30 Re: Event Triggers: adding information
Previous Message Alvaro Herrera 2013-01-17 21:59:03 Re: ALTER command reworks