Re: BUG #15412: "invalid contrecord length" during WAL replica recovery

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(at)paquier(dot)xyz
Cc: timur(dot)luchkin(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org, hlinnaka(at)iki(dot)fi
Subject: Re: BUG #15412: "invalid contrecord length" during WAL replica recovery
Date: 2018-10-12 07:20:29
Message-ID: 20181012.162029.59628939.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello.

At Mon, 1 Oct 2018 18:06:46 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20181001090646(dot)GM11712(at)paquier(dot)xyz>
> On Mon, Oct 01, 2018 at 08:38:23AM +0000, PG Bug reporting form wrote:
> > Sorry to post it again, but I really need help to recover broken replica.
> > LOG: invalid contrecord length 861 at 159E/A6FFFC40
>
> Heikki, Horiguchi-san, couldn't this be a side effect of ca572db22?
> I am afraid that this is not the first report we have on the matter
> lately.

First, I'd say it is not relevant to the patch with confidence.

The patch allows to fetch a contrecord in the next segment
anywhere available *after finding it is missing*. The server in
trouble fetches segments from WAL archive continuously in the
case. I suppose that the "offsite WAL replica" is "A server that
is not a part of the main site cluster and it is recovering from
it's own archive files that are continuously fed from (maybe) the
master in the main site".

> <2018-09-25 08:07:23 UTC--- [app:,pid:19517,00000]>LOG: invalid contrecord
length 861 at 159E/A6FFFC40

The last page for the contrecords resides in A7 is found to
disagree on the remaining bytes. I suspect that the A7 is copied
while halfway written (and the archve file should be overwritten
after master restart), even though I'm not sure how a halfway
written file leads to the failure.

I'd check consistency of the A7 file of the offsite replica
against the source (master or replica in the main site), using
md5 or something like. If they don't match, re-copying the A7
into the offsite archive directory will fix the problem.

Thoughts?

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-10-12 09:00:12 BUG #15428: "Inception" with recursive prepared statement causes infinite loop
Previous Message Alban Hertroys 2018-10-12 06:37:42 Re: Want to acquire lock on tables where primary of one table is foreign key on othere