Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Failing start-up archive recovery at Standby mode in PG9.2.4
Date: 2013-04-24 08:59:47
Message-ID: 20130424085947.GB4536@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

ello,

On 2013-04-24 17:43:39 +0900, KONDO Mitsumasa wrote:
> Hi,
>
> I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication.
> I test same problem in PG9.2.3. But it is not occerd...

> > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restored log file "000000020000000000000013" from archive

I can't read the error message here, but this looks suspicious. The
recovery command seems to be returning success although it prints such
an error message?

> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION: ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed
> > [Master] 2013-04-22 01:27:25 EDTLOCATION: XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: streaming replication successfully connected to primary
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:25 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment 000000030000000000000014 has already been removed
> >
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_receive, libpqwalreceiver.c:389
> >
> > [mitsu-ko(at)localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:30 EDTDEBUG: 00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION: ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed
> > [Master] 2013-04-22 01:27:30 EDTLOCATION: XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:30 EDTLOG: 00000: streaming replication successfully connected to primary
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:30 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment 000000030000000000000014 has already been removed

And this seems to be the second problem. You probably need to configure
a higher wal_keep_segments on the primary or you need to fix your
recovery_command.

> We try to fix this problem with my company's colleaguea,
> but it is too difficult and complicated to WAL contorol and validation method...
>
> This problem is occerd in xlog.c:10692.
> > 10692 if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> > 10693 goto next_record_is_invalid;
>
> And problem is occerd this checking in ValidXLOGHeader().
> > 4390 if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
> > 4391 {
> > 4392 ereport(emode_for_corrupt_record(emode, recaddr),
> > 4393 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
> > 4394 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
> > 4395 readId, readSeg, readOff)));
> > 4396 return false;
> > 4397 }
>
> We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
> We cannot analyze why this value is uncorrect!

I think the tests is just noticing that we don't have valid WAL - we are
probably reading old preallocated wal files from before starting the
node as a standby. Normally it will get the correct data from the
primary after that (you can see the START_REPLICATION commands in the
log), but that failed.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2013-04-24 09:24:04 Re: Fast promotion, loose ends
Previous Message Heikki Linnakangas 2013-04-24 08:53:58 Re: Fast promotion, loose ends