Re: BUG #13822: Slave terminated - WAL contains references to invalid page

From: <Marek(dot)Petr(at)tieto(dot)com>
To: <michael(dot)paquier(at)gmail(dot)com>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13822: Slave terminated - WAL contains references to invalid page
Date: 2015-12-29 15:51:32
Message-ID: aa185a8bc7db4000b76803b81308b9bc@C105S135VM024.eu.tieto.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I used fresh base backup for slave after both crashes.
Also tried to scan archived wals several hours before last crash and found only the following for string 71566:

rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253; blk 71566

Regards
Marek

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Monday, December 28, 2015 4:29 PM
To: Petr Marek <Marek(dot)Petr(at)tieto(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

On Mon, Dec 28, 2015 at 8:50 PM, <Marek(dot)Petr(at)tieto(dot)com> wrote:
> Tried to use pageinspect module for affected pages from last two occurences:
>
> 2015-12-15 13:05:39 CET @ WARNING: page 4333275 of relation
> base/16422/17230 is uninitialized
> 2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation
> base/16422/23253 is uninitialized
>
> Following outputs are the same for master and slave:
>
> [results]

Hm, OK.

> Non-default pars:
>
> [params]

There is nothing fishy here.

> Slave rebuilded and it's running almost a week for now.

Hm. Has this slave replayed the same WAL records as the slave that has failed previously? Or did it use a fresher base backup? If that's the latter the problem would have been fixed by itself for those two relation pages as they would have been correctly created by the And and not the WAL replay. Perhaps that's too late, but could it be possible to scan the WAL segments you have and see if there is record referring to those pages being initialized or not? You would need to find a record like that:
[insert|update|multi-insert|hot_update](init) rel %u/%u/%u; tid %u/%u tid is t_ctid referred in those upper results you just sent. And this record should normally be present before the ones that caused the PANIC setting the visibility map bit. If that's not the case, it may be possible that there is actually a bug if the page is not found as being initialized properly first. At least we are sure that the corruption is not coming from the master.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alain Laporte 2015-12-29 15:56:04 Re: BUG #13837: Files in pg_committs not deleted
Previous Message Thomas Munro 2015-12-29 03:24:48 Re: BUG #13837: Files in pg_committs not deleted