Skip site navigation (1) Skip section navigation (2)

Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>,Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>,hlinnakanga(at)awork2(dot)anarazel(dot)de
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Date: 2013-01-17 13:05:15
Message-ID: 20130117130515.GA19562@awork2.anarazel.de (view raw or flat)
Thread:
Lists: pgsql-hackers
On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:
> Hi all,
> 
> There is a strange bug with the latest master head (commit 7fcbf6a).
> When the WAL stream with a master is cut on a slave, slave returns a FATAL
> (well normal...), but then enters in recovery process and automatically
> promotes.
> Here are more details about the logs on slave (I simply killed the master
> manually):
> FATAL:  could not receive data from WAL stream:
> cp: cannot stat
> ‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such
> file or directory
> LOG:  record with zero length at 0/401E1B8
> LOG:  redo done at 0/401E178
> LOG:  last completed transaction was at log time 2013-01-17
> 20:27:53.180971+09
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’:
> No such file or directory
> LOG:  selected new timeline ID: 2
> cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’:
> No such file or directory
> LOG:  archive recovery complete
> DEBUG:  resetting unlogged relations: cleanup 0 init 1
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> DEBUG:  archived transaction log file "000000010000000000000004"
> DEBUG:  archived transaction log file "00000002.history"
> LOG:  statement: create table bn (a int);
> DEBUG:  autovacuum: processing database "postgres"
> 
> Slave does not try anymore to reconnect to master with messages of the type:
> FATAL:  could not connect to the primary server
> 
> I also noticed that there is some delay until modifications on master are
> visible on slave.
> For example run a simple CREATE TABLE and the new table is not
> 
> [some bisecting later...]
> 
> I think that bug has been introduced by commit 7fcbf6a.
> Before splitting xlog reading as a separate facility things worked
> correctly.
> There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


In response to

Responses

pgsql-hackers by date

Next:From: Craig RingerDate: 2013-01-17 13:17:07
Subject: Re: CF3+4
Previous:From: Andrew DunstanDate: 2013-01-17 12:59:02
Subject: Re: review: pgbench - aggregation of info written into log

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group