BUG #8686: Standby could not restart.

From: katsumata(dot)tomonari(at)po(dot)ntts(dot)co(dot)jp
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #8686: Standby could not restart.
Date: 2013-12-19 02:57:26
Message-ID: E1VtTni-00082E-Jv@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 8686
Logged by: Tomonari Katsumata
Email address: katsumata(dot)tomonari(at)po(dot)ntts(dot)co(dot)jp
PostgreSQL version: 9.2.6
Operating system: Red Hat Enterprise Linux 6.2 x86_64
Description:

Hi,

I'm testing whether a standby could restart with asynchronous replication.

The testcase is very simple like below.
(A) start asyncronous replication
(B) stop standby with "-m f"
(C) start standby

I tried (B) and (C) periodically.

In almost cases, it worked fine.
But standby could not start one time because of PANIC.
---- Log -----
[Standby] 2013-12-09 10:42:30 JST LOG: 00000: database system was shut down
in recovery at 2013-12-09 10:42:29 JST
[Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6273
cp: cannot stat `../arc/00000002.history': No such file or directory
[Standby] 2013-12-09 10:42:30 JST LOG: 00000: entering standby mode
[Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6359
cp: cannot stat `../arc/000000010000000100000008': No such file or
directory
[Standby] 2013-12-09 10:42:30 JST LOG: 00000: consistent recovery state
reached at 1/8E7F110
[Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency,
xlog.c:7366
[Standby] 2013-12-09 10:42:30 JST LOG: 00000: restored log file
"000000010000000100000007" from archive
[Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile,
xlog.c:3273
[Standby] 2013-12-09 10:42:30 JST LOG: 00000: redo starts at 1/783B230
[Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6827
[Standby] 2013-12-09 10:42:30 JST WARNING: 01000: page 1365 of relation
base/16384/16388 does not exist
[Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel
1663/16384/16388; tid 1365/152; new 1365/153
[Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page,
xlogutils.c:66
[Standby] 2013-12-09 10:42:30 JST PANIC: XX000: WAL contains references to
invalid pages
[Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel
1663/16384/16388; tid 1365/152; new 1365/153
[Standby] 2013-12-09 10:42:30 JST LOCATION: log_invalid_page,
xlogutils.c:91
[Standby] 2013-12-09 10:42:30 JST LOG: 00000: startup process (PID 12560)
was terminated by signal 6: Aborted
[Standby] 2013-12-09 10:42:30 JST LOCATION: LogChildExit,
postmaster.c:2895
[Standby] 2013-12-09 10:42:30 JST LOG: 00000: terminating any other active
server processes
[Standby] 2013-12-09 10:42:30 JST LOCATION: HandleChildCrash,
postmaster.c:2682
[Standby] 2013-12-09 11:10:12 JST LOG: 00000: database system was
interrupted while in recovery at log time 2013-12-09 10:32:14 JST
[Standby] 2013-12-09 11:10:12 JST HINT: If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
[Standby] 2013-12-09 11:10:12 JST LOCATION: StartupXLOG, xlog.c:6289
---------

I tried to fix this problem.
At first, I doubted the recovery state reached "consistent" before redo
starts.
And then I checked pg_control and related WAL.
The WAL sequence is like below.

WAL--(a)--(b)--(c)--(d)--(e)-->
================================================
(a) Latest checkpoint's REDO location
1/783B230

(b) hot_update
1/7842010

(c) truncate
1/8E7E5C8

(d) Latest checkpoint location
1/8E7F0B0

(e) Minimum recovery ending location
1/8E7F110
================================================

>From these things, I found it has happened with this scenario.
----------
(1) standby starting
(2) seeking checkpoint location 1/8E7F0B0 because backup_label is not
absecnt
(3) reachedConsistency is set to true at 1/8E7F110 in
CheckRecoveryConsistent
(4) redo start from 1/783B230
(5) PANIC at 1/7842010 because reachedConsistency has set already and
operating against a block which will be truncated at (c).
----------

At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0),
so reachedConsistency is set to true at step(3).

I think it's not need to increase EndRecPtr while seeking checkpoint
location.
I tried to revise it and this worked fine.

I think this is very very narrow gate, but it could happen.

I'm not sure this problem happen with synchronous replication or
another version of PostgreSQL(ie. 9.3/9.1/9.0),
but at least we need to fix it.

regards,
----------------
Tomonari Katsumata

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomonari Katsumata 2013-12-19 03:01:43 Re: BUG #8686: Standby could not restart.
Previous Message Alvaro Herrera 2013-12-19 00:58:08 Re: Update with subselect sometimes returns wrong result