Re: Standby corruption after master is restarted

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Cc: gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com
Subject: Re: Standby corruption after master is restarted
Date: 2018-03-28 17:50:16
Message-ID: CAE2gYzwUUW+_o8hSQuUkpgJj=OkerBghuKXUe4ZE12dJWA2vaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

We experienced this issue again, this time on production. The primary
instance was in a loop of being killed by Linux OOM-killer and being
restarted in 1 minute intervals. The corruption only happened on one
of the two standbys. The primary and the other standby have no
problems. Only the was killed and restarted, the standbys were not.
There weren't any unusual settings, "fsync" was not disabled. Here is
the information I collected.

The logs at the time standby broke:

> 2018-03-28 14:00:30 UTC [3693-67] LOG: invalid resource manager ID 39 at 1DFB/D43BE688
> 2018-03-28 14:00:30 UTC [25347-1] LOG: started streaming WAL from primary at 1DFB/D4000000 on timeline 5
> 2018-03-28 14:00:59 UTC [3748-357177] LOG: restartpoint starting: time
> 2018-03-28 14:01:23 UTC [25347-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
> 2018-03-28 14:01:24 UTC [3693-68] FATAL: invalid memory alloc request size 1916035072

And from the next try:

> 2018-03-28 14:02:15 UTC [26808-5] LOG: consistent recovery state reached at 1DFB/D6BDDFF8
> 2018-03-28 14:02:15 UTC [26808-6] FATAL: invalid memory alloc request size 191603507

The output of pg_controldata:

> pg_control version number: 942
> Catalog version number: 201510051
> Database system identifier: 6223303827582122631
> Database cluster state: in archive recovery
> pg_control last modified: 2018-03-28T15:25:51 UTC
> Latest checkpoint location: 1DFB/D1E17FD8
> Prior checkpoint location: 1DFB/D1E17FD8
> Latest checkpoint's REDO location: 1DFB/D1E17FD8
> Latest checkpoint's REDO WAL file: 0000000500001DFB000000D1
> Latest checkpoint's TimeLineID: 5
> Latest checkpoint's PrevTimeLineID: 5
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID: 2/3522301360
> Latest checkpoint's NextOID: 141900
> Latest checkpoint's NextMultiXactId: 3434
> Latest checkpoint's NextMultiOffset: 7985
> Latest checkpoint's oldestXID: 3329390243
> Latest checkpoint's oldestXID's DB: 16385
> Latest checkpoint's oldestActiveXID: 0
> Latest checkpoint's oldestMultiXid: 1
> Latest checkpoint's oldestMulti's DB: 16385
> Latest checkpoint's oldestCommitTsXid:0
> Latest checkpoint's newestCommitTsXid:0
> Time of latest checkpoint: 2018-03-28T13:55:52 UTC
> Fake LSN counter for unlogged rels: 0/1
> Minimum recovery ending location: 1DFB/D6BDDFF8
> Min recovery ending loc's timeline: 5
> Backup start location: 0/0
> Backup end location: 0/0
> End-of-backup record required: no
> wal_level setting: hot_standby
> wal_log_hints setting: on
> max_connections setting: 500
> max_worker_processes setting: 8
> max_prepared_xacts setting: 0
> max_locks_per_xact setting: 64
> track_commit_timestamp setting: off
> Maximum data alignment: 8
> Database block size: 8192
> Blocks per segment of large relation: 131072
> WAL block size: 8192
> Bytes per WAL segment: 16777216
> Maximum length of identifiers: 64
> Maximum columns in an index: 32
> Maximum size of a TOAST chunk: 1996
> Size of a large-object chunk: 2048
> Date/time type storage: 64-bit integers
> Float4 argument passing: by value
> Float8 argument passing: by value
> Data page checksum version: 0

Hexdump of the WAL file at the position on standby:

0bddfc0 0002 0003 0004 0005 0006 0007 0008 0009
0bddfd0 000a 000b 000c 000d 000e 000f 0010 0011
0bddfe0 0012 0013 0014 0015 0016 0017 0000 0000
0bddff0 0000 0017 0000 0000 5f32 7234 4b55 7045
0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000
0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34
0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66
0bde030 0765 7375 6111 646e 6f72 6469 370d 312e

Hexdump at the same position on the master:

0bddfc0 0002 0003 0004 0005 0006 0007 0008 0009
0bddfd0 000a 000b 000c 000d 000e 000f 0010 0011
0bddfe0 0012 0013 0014 0015 0016 0017 0000 0000
0bddff0 0000 0017 0000 0000 006a 0000 0000 0000
0bde000 d087 0005 0005 0000 e000 d6bd 1dfb 0000
0bde010 0062 0000 0000 0000 df90 d6bd 1dfb 0000
0bde020 0000 0000 63d3 aa61 50ff dff8 d6bd 1dfb
0bde030 0000 0005 0000 0005 0000 0001 0000 0002

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David G. Johnston 2018-03-28 18:32:55 Re: Code Igniter 3.16 Error Get Last Insert Id
Previous Message Angga Wirapraditya 2018-03-28 17:47:51 Code Igniter 3.16 Error Get Last Insert Id

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-28 17:52:24 Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
Previous Message David Steele 2018-03-28 17:46:02 Re: Reopen logfile on SIGHUP