Postgres WAL Recovery Fails... And Then Works...

From: Phil Monroe <phil(at)identified(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Cc: Infrastructure <infrastructure(at)identified(dot)com>, Vladimir Giverts <vlad(at)identified(dot)com>, Tejas <tejas(at)identified(dot)com>, Sasha Kipervarg <sasha(at)identified(dot)com>
Subject: Postgres WAL Recovery Fails... And Then Works...
Date: 2013-01-12 02:32:30
Message-ID: 50F0CB3E.4080708@identified.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Everyone,

So we had to failover and do a full base backup to get our slave
database back online and ran into a interesting scenario. After copying
the data directory, setting up the recovery.conf, and starting the slave
database, the database crashes while replaying xlogs. However, trying to
start the database again, the database is able to replay xlogs farther
than it initially got, but ultimately ended up failing out again. After
starting the DB a third time, PostgreSQL replays even further and
catches up to the master to start streaming replication. Is this common
and or acceptable?

The error messages that cause PostgreSQL to stop replaying xlogs:

ERROR #1:
2013-01-12 00:17:17 UTC::@:[61150]: WARNING: page 2387476480 of
relation base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC: WAL contains references to
invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG: startup process (PID 61150)
was terminated by signal 6: Aborted

ERROR #2:
2013-01-12 01:02:17 UTC::@:[36831]: WARNING: page 2387607552 of
relation base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC: WAL contains references to
invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG: startup process (PID 36831)
was terminated by signal 6: Aborted

Fortunately, these errors only pertain to indexes, which can be rebuilt.
But is this a sign that the data directory on the slave is corrupt?

Here is a full timeline of the logshighlighting the issues. :

1. Data Directory Copy Finishes.
2. Recovery.conf Setup

3. Start #1:

2013-01-12 00:14:23 UTC::@:[61150]: LOG: database system was shut down
in recovery at 2013-01-12 00:14:06 UTC
2013-01-12 00:14:23 UTC:[local]:[unknown](at)[unknown]:[61151]: LOG:
incomplete startup packet
2013-01-12 00:14:23 UTC:[local]:postgres(at)postgres:[61152]: FATAL: the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:14:23 UTC::@:[61150]: LOG: entering standby mode
2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
"0000000900008E45000000B8" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
"0000000900008E450000008B" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG: could not open directory
"pg_snapshots": No such file or directory
2013-01-12 00:14:23 UTC::@:[61150]: LOG: redo starts at 8E45/8B173180
2013-01-12 00:14:23 UTC::@:[61150]: LOG: consistent recovery state
reached at 8E45/8B174840
2013-01-12 00:14:23 UTC:[local]:postgres(at)postgres:[61175]: FATAL: the
database system is starting up
2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
"0000000900008E450000008C" from archive
2013-01-12 00:14:24 UTC::@:[61150]: LOG: restored log file
"0000000900008E450000008D" from archive

*SNIP*

2013-01-12 00:17:17 UTC::@:[61150]: LOG: restored log file
"0000000900008E4800000066" from archive
2013-01-12 00:17:17 UTC::@:[61150]: LOG: restored log file
"0000000900008E4800000067" from archive
2013-01-12 00:17:17 UTC::@:[61150]: WARNING: page 2387476480 of
relation base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC: WAL contains references to
invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG: startup process (PID 61150)
was terminated by signal 6: Aborted
2013-01-12 00:17:18 UTC::@:[61121]: LOG: terminating any other active
server processes

4. PostgreSQL shuts down...
5. Debugging logs enabled in postgresql.conf.

6. Start #2:

2013-01-12 00:51:48 UTC::@:[36831]: LOG: database system was
interrupted while in recovery at log time 2013-01-11 18:05:31 UTC
2013-01-12 00:51:48 UTC::@:[36831]: HINT: If this has occurred more
than once some data might be corrupted and you might need to choose an
earlier recovery target.
2013-01-12 00:51:48 UTC:[local]:[unknown](at)[unknown]:[36832]: LOG:
incomplete startup packet
2013-01-12 00:51:48 UTC:[local]:postgres(at)postgres:[36833]: FATAL: the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:51:48 UTC::@:[36831]: LOG: entering standby mode
2013-01-12 00:51:48 UTC::@:[36831]: LOG: restored log file
"0000000900008E45000000B8" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: checkpoint record is at
8E45/B80AF650
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: redo record is at
8E45/8B173180; shutdown FALSE
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: next transaction ID:
0/552803703; next OID: 24427698
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: next MultiXactId: 109027;
next MultiXactOffset: 2442921
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: oldest unfrozen transaction
ID: 3104202601, in database 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: transaction ID wrap limit is
956718952, limited by database with OID 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: resetting unlogged
relations: cleanup 1 init 0
2013-01-12 00:51:48 UTC::@:[36831]: LOG: could not open directory
"pg_snapshots": No such file or directory
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: initializing for hot standby
2013-01-12 00:51:48 UTC::@:[36831]: LOG: restored log file
"0000000900008E450000008B" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: LOG: redo starts at 8E45/8B173180
2013-01-12 00:51:49 UTC::@:[36831]: LOG: restored log file
"0000000900008E450000008C" from archive

*SNIP*

2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000062" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000063" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000064" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000065" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000066" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000067" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG: consistent recovery state
reached at 8E48/67AC4E28
2013-01-12 00:54:46 UTC::@:[36796]: LOG: database system is ready to
accept read only connections
2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000068" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
"0000000900008E4800000069" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
"0000000900008E480000006A" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
"0000000900008E480000006B" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
"0000000900008E480000006C" from archive

*SNIP*

2013-01-12 01:02:16 UTC::@:[36831]: LOG: restored log file
"0000000900008E4F00000079" from archive
2013-01-12 01:02:16 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: LOG: restored log file
"0000000900008E4F0000007A" from archive
2013-01-12 01:02:17 UTC::@:[36831]: DEBUG: got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: WARNING: page 2387607552 of
relation base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC: WAL contains references to
invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG: startup process (PID 36831)
was terminated by signal 6: Aborted
2013-01-12 01:02:20 UTC::@:[36796]: LOG: terminating any other active
server processes

7. PostgreSQL shuts down...

8. Start #3:

2013-01-12 01:04:04 UTC::@:[61742]: LOG: database system was
interrupted while in recovery at log time 2013-01-11 19:50:31 UTC
2013-01-12 01:04:04 UTC::@:[61742]: HINT: If this has occurred more
than once some data might be corrupted and you might need to choose an
earlier recovery target.
2013-01-12 01:04:04 UTC:[local]:[unknown](at)[unknown]:[61743]: LOG:
incomplete startup packet
2013-01-12 01:04:04 UTC:[local]:postgres(at)postgres:[61744]: FATAL: the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 01:04:04 UTC::@:[61742]: LOG: entering standby mode
2013-01-12 01:04:04 UTC::@:[61742]: LOG: restored log file
"0000000900008E4A00000039" from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: checkpoint record is at
8E4A/39CD4BA0
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: redo record is at
8E4A/19F0D210; shutdown FALSE
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: next transaction ID:
0/552859005; next OID: 24427698
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: next MultiXactId: 109221;
next MultiXactOffset: 2443321
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: oldest unfrozen transaction
ID: 3104202601, in database 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: transaction ID wrap limit is
956718952, limited by database with OID 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: resetting unlogged
relations: cleanup 1 init 0
2013-01-12 01:04:04 UTC::@:[61742]: LOG: could not open directory
"pg_snapshots": No such file or directory
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: initializing for hot standby
2013-01-12 01:04:05 UTC::@:[61742]: LOG: restored log file
"0000000900008E4A00000019" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:04:05 UTC::@:[61742]: LOG: redo starts at 8E4A/19F0D210
2013-01-12 01:04:05 UTC::@:[61742]: LOG: restored log file
"0000000900008E4A0000001A" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG: got WAL segment from archive

*SNIP*

2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F00000077" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F00000078" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F00000079" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F0000007A" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG: consistent recovery state
reached at 8E4F/7A22BD08
2013-01-12 01:09:38 UTC::@:[61729]: LOG: database system is ready to
accept read only connections
2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F0000007B" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F0000007C" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F0000007D" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:09:39 UTC::@:[61742]: LOG: restored log file
"0000000900008E4F0000007E" from archive

*SNIP*

2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
"0000000900008E53000000D9" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
"0000000900008E53000000DA" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
"0000000900008E53000000DB" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG: restored log file
"0000000900008E53000000DC" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG: got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG: restored log file
"0000000900008E53000000DD" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG: got WAL segment from archive
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[61742]: LOG: unexpected pageaddr
8E4A/1000000 in log file 36435, segment 222, offset 0
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[19504]: LOG: streaming replication
successfully connected to primary
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=1
file=base/16408/22873432 time=2.538 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=2
file=base/16408/18967506 time=12.054 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=3
file=base/16408/18967506_fsm time=0.095 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=4
file=base/16408/22873244 time=0.144 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=5
file=base/16408/22892823 time=0.087 msec

9. Slave DB connected to streaming replication with Master DB and all
seems fine.

Any help would be appreciated. Thanks!

--
<http://www.identified.com/phil-monroe>

Phil Monroe

Stuff | Identified

185 Berry Street Suite 6100, San Francisco, CA 94107

* *m* 614 832 9495
* *e* phil(at)identified(dot)com <mailto:phil(at)identified(dot)com>
* *w* identified.com <http://www.identified.com/>

* *tw* @philmonroe <https://twitter.com/#%21/philmonroe>
* *e* phil(at)philmonroe(dot)com <mailto:phil(at)philmonroe(dot)com>
* *w* philmonroe.com <http://philmonroe.com>

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Magnus Hagander 2013-01-13 13:30:56 Re: Postgres base Backup fails to recover all logs fails from archive
Previous Message Kevin Grittner 2013-01-11 22:58:15 Re: pg_dump and restore