Re: Standby corruption after master is restarted

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>, gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com
Subject: Re: Standby corruption after master is restarted
Date: 2018-04-16 08:55:13
Message-ID: CAE2gYzxnF1wwUg_nysXLKg9CMjz7SHCDdv=7xOShaA-_=uLT+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

> In the initial report (from August 2018) you shared pg_xlogdump output,
> showing that the corrupted WAL record is an FPI_FOR_HINT right after
> CHECKPOINT_SHUTDOWN. Was it the same case this time?

No, this time the xlogdump on master looks like this:

> rmgr: Heap2 len (rec/tot): 110/ 110, tx: 0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331540
> rmgr: Heap2 len (rec/tot): 100/ 100, tx: 0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331541
> rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc: CHECKPOINT_SHUTDOWN redo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset 7985; oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
> rmgr: Heap len (rec/tot): 56/ 2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT off 93, blkref #0: rel 1663/40796/40826 blk 0 FPW
> rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT 2018-03-28 14:01:36.800811 UTC
> rmgr: Heap len (rec/tot): 56/ 1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT off 41, blkref #0: rel 1663/40796/40820 blk 345119 FPW
> rmgr: Btree len (rec/tot): 55/ 1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc: INSERT_LEAF off 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW
> rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT 2018-03-28 14:01:37.361010 UTC

> BTW which versions are we talking about? I see the initial report
> mentioned catversion 201608131, this one mentions 201510051, so I'm
> guessing 9.6 and 9.5. Which minor versions?

The first incident was on 9.6.3, the second 9.5.7.

> Is the master under load (accepting writes) before shutdown?

Yes. Especially on the second incident it was being written heavily.
The master gets up to 10k INSERT transactions per second before being
killed.

> How was it restarted, actually? I see you're mentioning OOM killer, so I
> guess "kill -9". What about the first report - was it the same case, or
> was it restarted "nicely" using pg_ctl?

On the first incident the master was being restarted as part of
upgrade from 9.6.3 to 9.6.4. It should have been nicely done by the
Debian upgrade scripts.

> Could the replica receive the WAL in some other way - say, from a WAL
> archive? What archive/restore commands you use?

We have repmgr + barman setup:

> archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet --contimeout=2 --timeout=2 %p BARMAN_HOST::barman/'
> restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p'

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Emre Hasegeli 2018-04-16 08:58:35 Re: Standby corruption after master is restarted
Previous Message Aman Bhadoriya 2018-04-16 06:12:31 installation error

Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2018-04-16 08:58:35 Re: Standby corruption after master is restarted
Previous Message Devrim Gündüz 2018-04-16 08:49:31 Re: submake-errcodes