Re: BUG #13822: Slave terminated - WAL contains references to invalid page

From: <Marek(dot)Petr(at)tieto(dot)com>
To: <andres(at)anarazel(dot)de>
Cc: <michael(dot)paquier(at)gmail(dot)com>, <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13822: Slave terminated - WAL contains references to invalid page
Date: 2015-12-30 20:20:14
Message-ID: 1451506840099.98963@tieto.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

I created new replica (original one crashed 2015-12-15 13:05:39 CET) via:

$ pg_basebackup -h <master_IP> -D $PGDATA -R -P -U replication --xlog-method=stream
$ cat $PGDATA/recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replication password=*** host=<master_IP> port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
# service postgresql-9.4 start

This new replica crashed 2015-12-22 00:25:11 CET.

Ordinary base backups are done by:

/usr/pgsql-9.4/bin/pg_basebackup -h <master_IP> -D <local_pg_basebackup_dir> -Ft -z -v -Xf -U postgres

Running Master pg_controldata:

pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6225208935473175762
Database cluster state: in production
pg_control last modified: Wed 30 Dec 2015 08:00:02 PM CET
Latest checkpoint location: 191/2E08DC90
Prior checkpoint location: 191/2E0829E0
Latest checkpoint's REDO location: 191/2E08DC58
Latest checkpoint's REDO WAL file: 00000001000001910000002E
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/370154188
Latest checkpoint's NextOID: 4803404
Latest checkpoint's NextMultiXactId: 161565
Latest checkpoint's NextMultiOffset: 410484
Latest checkpoint's oldestXID: 187316348
Latest checkpoint's oldestXID's DB: 16422
Latest checkpoint's oldestActiveXID: 370154188
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16422
Time of latest checkpoint: Wed 30 Dec 2015 08:00:01 PM CET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current wal_log_hints setting: off
Current max_connections setting: 600
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0

Running Slave pg_controldata:

pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6225208935473175762
Database cluster state: in archive recovery
pg_control last modified: Wed 30 Dec 2015 08:00:40 PM CET
Latest checkpoint location: 191/2E08DC90
Prior checkpoint location: 191/2E0829E0
Latest checkpoint's REDO location: 191/2E08DC58
Latest checkpoint's REDO WAL file: 00000001000001910000002E
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/370154188
Latest checkpoint's NextOID: 4803404
Latest checkpoint's NextMultiXactId: 161565
Latest checkpoint's NextMultiOffset: 410484
Latest checkpoint's oldestXID: 187316348
Latest checkpoint's oldestXID's DB: 16422
Latest checkpoint's oldestActiveXID: 370154188
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16422
Time of latest checkpoint: Wed 30 Dec 2015 08:00:01 PM CET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 191/2E093C68
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current wal_log_hints setting: off
Current max_connections setting: 600
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0

Slave pg_controldata after second PITR attempt:

pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6225208935473175762
Database cluster state: shut down in recovery
pg_control last modified: Wed 30 Dec 2015 08:53:51 PM CET
Latest checkpoint location: 187/9DA56658
Prior checkpoint location: 186/D7000060
Latest checkpoint's REDO location: 187/9D647E30
Latest checkpoint's REDO WAL file: 00000001000001870000009D
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/367008737
Latest checkpoint's NextOID: 4721484
Latest checkpoint's NextMultiXactId: 160851
Latest checkpoint's NextMultiOffset: 409011
Latest checkpoint's oldestXID: 187316348
Latest checkpoint's oldestXID's DB: 16422
Latest checkpoint's oldestActiveXID: 367008737
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16422
Time of latest checkpoint: Tue 22 Dec 2015 12:21:40 AM CET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 187/9FECB9A0
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current wal_log_hints setting: off
Current max_connections setting: 600
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0

# cat backup_label.old
START WAL LOCATION: 186/D7000028 (file 0000000100000186000000D7)
CHECKPOINT LOCATION: 186/D7000060
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2015-12-21 19:00:03 CET
LABEL: pg_basebackup base backup

Slave's pg_controldata above after PITR has been created after second recovery attempt.
To be honest I don't understand why this second PITR finished without PANIC when previous attempt from the same base backup and archived wals failed.
See logs details below:

First (crashed) PITR:

2015-12-30 08:31:19 CET @ LOG: database system was interrupted; last known up at 2015-12-21 20:15:02 CET
2015-12-30 08:31:19 CET @ LOG: starting point-in-time recovery to 2015-12-22 00:30:00+01
2015-12-30 08:31:19 CET @ LOG: database system was not properly shut down; automatic recovery in progress
2015-12-30 08:31:19 CET @ LOG: redo starts at 186/D7260540
2015-12-30 08:31:19 CET @ LOG: consistent recovery state reached at 186/D8000000
2015-12-30 08:31:19 CET @ LOG: database system is ready to accept read only connections
2015-12-30 08:31:19 CET @ LOG: restored log file "0000000100000186000000D8" from archive
2015-12-30 08:31:20 CET @ LOG: restored log file "0000000100000186000000D9" from archive
/.../
2015-12-30 08:32:48 CET @ LOG: restored log file "00000001000001870000009D" from archive
2015-12-30 08:32:50 CET @ LOG: restored log file "00000001000001870000009E" from archive
2015-12-30 08:32:51 CET @ WARNING: page 73172 of relation base/16422/23253 is uninitialized
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ PANIC: WAL contains references to invalid pages
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ LOG: startup process (PID 29324) was terminated by signal 6: Aborted
2015-12-30 08:32:51 CET @ LOG: terminating any other active server processes

For this one unfortunatelly I don't have pg_controldata.

Second one:

2015-12-30 20:42:39 CET @ LOG: database system was interrupted; last known up at 2015-12-21 20:15:02 CET
2015-12-30 20:42:39 CET @ LOG: starting point-in-time recovery to 2015-12-22 00:30:00+01
2015-12-30 20:42:39 CET @ LOG: restored log file "0000000100000186000000D7" from archive
2015-12-30 20:42:39 CET @ LOG: redo starts at 186/D7000028
2015-12-30 20:42:39 CET @ LOG: consistent recovery state reached at 186/D7265F80
2015-12-30 20:42:39 CET @ LOG: database system is ready to accept read only connections
2015-12-30 20:42:40 CET @ LOG: restored log file "0000000100000186000000D8" from archive
2015-12-30 20:42:40 CET @ LOG: restored log file "0000000100000186000000D9" from archive
/.../
2015-12-30 20:44:03 CET @ LOG: restored log file "00000001000001870000009D" from archive
2015-12-30 20:44:05 CET @ LOG: restored log file "00000001000001870000009E" from archive
2015-12-30 20:44:08 CET @ LOG: restored log file "00000001000001870000009F" from archive
2015-12-30 20:44:10 CET @ LOG: recovery stopping before commit of transaction 367008747, time 2015-12-22 00:30:01.876776+01
2015-12-30 20:44:10 CET @ LOG: recovery has paused
2015-12-30 20:44:10 CET @ HINT: Execute pg_xlog_replay_resume() to continue.

Regards
Marek
________________________________________
From: Andres Freund <andres(at)anarazel(dot)de>
Sent: Wednesday, December 30, 2015 12:49 PM
To: Petr Marek
Cc: michael(dot)paquier(at)gmail(dot)com; pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

Hi,

On 2015-12-29 15:51:32 +0000, Marek(dot)Petr(at)tieto(dot)com wrote:
> I used fresh base backup for slave after both crashes.
> Also tried to scan archived wals several hours before last crash and found only the following for string 71566:
>
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216; blk 71566
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220; blk 71566
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253; blk 71566

Could you detail how you take base backups and how you set them up as
new replicas? Master's pg_controldata, the crashed standby's and the
standby's backup_label.old would be helpful.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2015-12-30 23:13:07 Re: BUG #13770: Extending recovery_min_apply_delay on Standby causes it to be unavailable for a while
Previous Message Ali Raza 2015-12-30 18:20:30 Re: BUG #13839: command line restart server fail