Standby corruption after master is restarted

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Cc: gurkan(dot)gur(at)innogames(dot)com
Subject: Standby corruption after master is restarted
Date: 2017-08-24 14:50:59
Message-ID: CAE2gYzzVZNsGn=-E6grO4sVQs04J02zNKQofQEO8gu8=qCFR+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

One of our standby servers on a testing environment at InnoGames broke
today. The master and the other standby server is working fine. I
noticed that it happened right after the master had restarted. It
looks to me like the WAL file is corrupted. I wanted to check, if it
can be a bug. These were on the log file:

> 2017-08-24 12:19:51 UTC [20351-2] LOG: replication terminated by primary server
> 2017-08-24 12:19:51 UTC [20351-3] DETAIL: End of WAL reached on timeline 1 at 0/9B9F5FF0.
> 2017-08-24 12:19:51 UTC [20351-4] FATAL: could not send end-of-streaming message to primary: no COPY in progress
> 2017-08-24 12:19:51 UTC [26673-7] FATAL: invalid memory alloc request size 1298694144
> 2017-08-24 12:19:51 UTC [26672-3] LOG: startup process (PID 26673) exited with exit code 1
> 2017-08-24 12:19:51 UTC [26672-4] LOG: terminating any other active server processes

I tried to start it again to get more logs:

> 2017-08-24 14:07:46 UTC [28861-12698] LOCATION: KnownAssignedXidsRemovePreceding, procarray.c:3738
> 2017-08-24 14:07:46 UTC [28861-12699] DEBUG: 00000: transaction ID wrap limit is 2147484807, limited by database with OID 1
> 2017-08-24 14:07:46 UTC [28861-12700] CONTEXT: xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> 2017-08-24 14:07:46 UTC [28861-12701] LOCATION: SetTransactionIdLimit, varsup.c:347
> 2017-08-24 14:07:46 UTC [28861-12702] DEBUG: 00000: prune KnownAssignedXids to 5869245
> 2017-08-24 14:07:46 UTC [28861-12703] CONTEXT: xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> 2017-08-24 14:07:46 UTC [28861-12704] LOCATION: KnownAssignedXidsRemovePreceding, procarray.c:3738
> 2017-08-24 14:07:46 UTC [28861-12705] FATAL: XX000: invalid memory alloc request size 1298694144
> 2017-08-24 14:07:46 UTC [28861-12706] LOCATION: palloc_extended, mcxt.c:961

These from pg_controldata:

> pg_control version number: 960
> Catalog version number: 201608131
> Database system identifier: 6423674571818176107
> Database cluster state: in archive recovery
> pg_control last modified: Thu 24 Aug 2017 02:01:42 PM UTC
> Latest checkpoint location: 0/9B9C1188
> Prior checkpoint location: 0/9B9C1188
> Latest checkpoint's REDO location: 0/9B9BEC60
> Latest checkpoint's REDO WAL file: 00000001000000000000009B
> Latest checkpoint's TimeLineID: 1
> Latest checkpoint's PrevTimeLineID: 1
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID: 0:5868410
> Latest checkpoint's NextOID: 28526
> Latest checkpoint's NextMultiXactId: 1
> Latest checkpoint's NextMultiOffset: 0
> Latest checkpoint's oldestXID: 1160
> Latest checkpoint's oldestXID's DB: 1
> Latest checkpoint's oldestActiveXID: 5868410
> Latest checkpoint's oldestMultiXid: 1
> Latest checkpoint's oldestMulti's DB: 1
> Latest checkpoint's oldestCommitTsXid:0
> Latest checkpoint's newestCommitTsXid:0
> Time of latest checkpoint: Thu 24 Aug 2017 12:02:56 PM UTC
> Fake LSN counter for unlogged rels: 0/1
> Minimum recovery ending location: 0/9B9E13C0
> Min recovery ending loc's timeline: 1
> Backup start location: 0/0
> Backup end location: 0/0
> End-of-backup record required: no
> wal_level setting: replica
> wal_log_hints setting: on
> max_connections setting: 400
> max_worker_processes setting: 8
> max_prepared_xacts setting: 0
> max_locks_per_xact setting: 64
> track_commit_timestamp setting: off
> Maximum data alignment: 8
> Database block size: 8192
> Blocks per segment of large relation: 131072
> WAL block size: 8192
> Bytes per WAL segment: 16777216
> Maximum length of identifiers: 64
> Maximum columns in an index: 32
> Maximum size of a TOAST chunk: 1996
> Size of a large-object chunk: 2048
> Date/time type storage: 64-bit integers
> Float4 argument passing: by value
> Float8 argument passing: by value
> Data page checksum version: 1

I tried pg_xlogdump and it failed on the spot:

> rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc: CHECKPOINT_SHUTDOWN redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> pg_xlogdump: FATAL: error in WAL record at 0/9B9F5F80: unexpected pageaddr 0/979F6000 in log segment 00000000000000000000009B, offset 10444800

This is from the master at the same place:

> rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc: CHECKPOINT_SHUTDOWN redo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> rmgr: XLOG len (rec/tot): 49/ 2805, tx: 0, lsn: 0/9B9F5FF0, prev 0/9B9F5F80, desc: FPI_FOR_HINT , blkref #0: rel 1663/16396/16398 blk 0 FPW
> rmgr: Heap len (rec/tot): 54/ 5046, tx: 5869245, lsn: 0/9B9F6B00, prev 0/9B9F5FF0, desc: INSERT off 54, blkref #0: rel 1663/16396/16427 blk 64203 FPW
> rmgr: Btree len (rec/tot): 53/ 7961, tx: 5869245, lsn: 0/9B9F7EB8, prev 0/9B9F6B00, desc: INSERT_LEAF off 281, blkref #0: rel 1663/16396/16446 blk 21754 FPW
> rmgr: Transaction len (rec/tot): 34/ 34, tx: 5869245, lsn: 0/9B9F9DF0, prev 0/9B9F7EB8, desc: COMMIT 2017-08-24 12:19:59.560138 UTC

I still have the database available.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2017-08-24 15:27:40 Re: BUG #14788: `pg_restore -c` won't restore schema access privileges.
Previous Message Masahiko Sawada 2017-08-24 14:49:33 Re: BUG #14788: `pg_restore -c` won't restore schema access privileges.

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-24 15:14:24 Re: Updating line length guidelines
Previous Message Tom Lane 2017-08-24 14:35:23 Re: Explicit relation name in VACUUM VERBOSE log