Re: Standby corruption after master is restarted

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: emre(at)hasegeli(dot)com
Cc: tomas(dot)vondra(at)2ndquadrant(dot)com, pgsql-bugs(at)postgresql(dot)org, gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Standby corruption after master is restarted
Date: 2018-04-26 10:53:04
Message-ID: 20180426.195304.118373455.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello. I found how this occurs.

# added -hackers

I've seen similar case with inadequite operation but this time I
found that it can happen under normal operation.

At Tue, 17 Apr 2018 17:14:05 +0200, Emre Hasegeli <emre(at)hasegeli(dot)com> wrote in <CAE2gYzwB0+MM3L6-zdK0JKO_aXqHNCKU7VAiDJk_kKRRJ4B=yA(at)mail(dot)gmail(dot)com>
> > OK, this seems to confirm the theory that there's a race condition between
> > segment recycling and replicating. It's likely limited to short period after
> > a crash, otherwise we'd probably see many more reports.
>
> I am suspicious about this part. It cannot be happening 2 times to us
> and never to anybody else. Maybe people do not report it because it
> is easy to deal with the problem. You just delete the corrupted WAL
> file. Or maybe there is something special in our infrastructure.
>
> Thank you for your help so far.
>
> >> This is the connection information. Although the master shows SSL
> >> compression is disabled in despite of being explicitly asked for.
>
> > Hmmm, that seems like a separate issue. When you say 'master shows SSL
> > compression is disabled' where do you see that?
>
> I thought it might be related first, so I mentioned. Then I found out
> on pg_stat_ssl view on the server that it is actually disabled.
> Probably the configuration of OpenSSL is disabling it.

A segment is not cleard on recycling. walreceiver writes WAL
record by record so startup process can see arbitrary byte
sequence after the last valid record when replication connection
is lost or standby is restarted.

The following scenario results in the similar situation.

1. create master and standby and run.

It makes happen this easily if wal_keep_segments is set large
(20 or so) on master and 0 on standby.

2. Write WAL to recycle happens on standby. Explicit checkpoints
on standby make it faster. May be required to run several
rounds before we see recycled segment on standby.

maybe_loop {
master:
create table t (a int);
insert into t (select a from generate_series(0, 150000) a);
delete from t;
checkpoint;

standby:
checkpoint;
<check for recycled segments in pg_wal>
}

3. stop master

4. standby starts to complain that master is missing.

At this time, standby complains for several kinds of failure. I
saw 'invalid record length' and 'incorrect prev-link' this
time. I saw 'invalid resource manager ID' when mixing different
size records. If XLogReadRecord saw a record with impossibly
large tot_length there, it will causes the palloc failure and
startup process dies.

5. If you see 'zero length record', it's nothing interesting.
Repeat 3 and 4 to see another.

This doesn't seem to happen on master since XLogWrite writes in
units of page so redo always sees zero length record or wrong
magic, or pageaddr at page boundaries after a crash. On the other
hand walreceiver writes in unit of records so redo can see
arbitrary byte sequence as xl_tot_len, xl_prev, xl_rmid while
fetching the next record. Espcially the first among them can
cause FATAL with "invalid memory alloc".

I think this behavior is a bug. XLogReadRecord is considering the
case but palloc_extended() breaks it. So in the attached, add a
new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and
allocate_recordbuf calls it with the new flag. That alone fixes
the problem. However, the patch frees read state buffer facing
errorneous records since such records can leave a too-large
buffer allocated.

After applying this patch, the behavior for the situation changes
as follows.

> LOG: started streaming WAL from primary at 0/83000000 on timeline 1
> LOG: record length 1298694144 at 0/83C17B70 too long
> FATAL: terminating walreceiver process due to administrator command
> LOG: record length 1298694144 at 0/83C17B70 too long
> LOG: record length 1298694144 at 0/83C17B70 too long
> LOG: received promote request
> LOG: redo done at 0/83C17B38
> LOG: last completed transaction was at log time 2018-04-26 19:10:12.360253+09
> LOG: selected new timeline ID: 2
> LOG: archive recovery complete
> LOG: database system is ready to accept connections

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
prevent_standby_from_crash_on_bogus_totlen.patch text/x-patch 2.9 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-04-26 12:13:48 Re: Standby corruption after master is restarted
Previous Message zoolus . 2018-04-26 09:29:50 Fwd: weird behavior of ORDER BY

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-04-26 11:04:57 Re: Built-in connection pooling
Previous Message Craig Ringer 2018-04-26 10:48:33 Re: "could not reattach to shared memory" on buildfarm member dory