Re: Standby corruption after master is restarted

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: emre(at)hasegeli(dot)com
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>, gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com
Subject: Re: Standby corruption after master is restarted
Date: 2018-04-16 13:41:34
Message-ID: be30f068-1876-a1dd-287b-7e6c68e7ed0b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 04/16/2018 10:55 AM, Emre Hasegeli wrote:
>> In the initial report (from August 2018) you shared pg_xlogdump output,
>> showing that the corrupted WAL record is an FPI_FOR_HINT right after
>> CHECKPOINT_SHUTDOWN. Was it the same case this time?
>
> No, this time the xlogdump on master looks like this:
>
>> rmgr: Heap2 len (rec/tot): 110/ 110, tx: 0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331540
>> rmgr: Heap2 len (rec/tot): 100/ 100, tx: 0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331541
>> rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc: CHECKPOINT_SHUTDOWN redo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset 7985; oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
>> rmgr: Heap len (rec/tot): 56/ 2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT off 93, blkref #0: rel 1663/40796/40826 blk 0 FPW
>> rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT 2018-03-28 14:01:36.800811 UTC
>> rmgr: Heap len (rec/tot): 56/ 1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT off 41, blkref #0: rel 1663/40796/40820 blk 345119 FPW
>> rmgr: Btree len (rec/tot): 55/ 1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc: INSERT_LEAF off 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW
>> rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT 2018-03-28 14:01:37.361010 UTC
>

OK, and on the standby? I suppose it breaks right after the
CHECKPOINT_SHUTDOWN record, right?

>> BTW which versions are we talking about? I see the initial report
>> mentioned catversion 201608131, this one mentions 201510051, so I'm
>> guessing 9.6 and 9.5. Which minor versions?
>
> The first incident was on 9.6.3, the second 9.5.7.
>
>> Is the master under load (accepting writes) before shutdown?
>
> Yes. Especially on the second incident it was being written heavily.
> The master gets up to 10k INSERT transactions per second before being
> killed.
>
>> How was it restarted, actually? I see you're mentioning OOM killer, so I
>> guess "kill -9". What about the first report - was it the same case, or
>> was it restarted "nicely" using pg_ctl?
>
> On the first incident the master was being restarted as part of
> upgrade from 9.6.3 to 9.6.4. It should have been nicely done by the
> Debian upgrade scripts.
>

Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it
depends on packages). I wonder if it does fast/smart/immediate.

>> Could the replica receive the WAL in some other way - say, from a WAL
>> archive? What archive/restore commands you use?
>
> We have repmgr + barman setup:
>
>> archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet --contimeout=2 --timeout=2 %p BARMAN_HOST::barman/'
>> restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p'

And can you determine how the broken WAL got to the standby? Was it by
directly fetching from the primary, or from the barman archive? Is it
broken in the archive too?

I think one hypothesis is that the standby starts fetching a WAL segment
from the primary, and while doing that it gets recycled and overwritten
(so the standby gets bogus somewhere half-way through). That should not
be possible, of course, but perhaps there's something broken right after
a restart? A race condition or something like that?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Emre Hasegeli 2018-04-16 14:55:59 Re: Standby corruption after master is restarted
Previous Message Emre Hasegeli 2018-04-16 08:58:35 Re: Standby corruption after master is restarted

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-04-16 13:52:24 Re: very slow queries when max_parallel_workers_per_gather is higher than zero
Previous Message Pavel Stehule 2018-04-16 13:20:03 Re: very slow queries when max_parallel_workers_per_gather is higher than zero