Re: BUG #15346: Replica fails to start after the crash

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15346: Replica fails to start after the crash
Date: 2018-08-25 07:54:39
Message-ID: CAFh8B=mYGP1J8ueHKU40Dr4bNBUPJj1rTbZ6Wpg4gR30uCRunA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi,

btree_xlog_delete_get_latestRemovedXid is callled only when the
postgres is already running InHotStandby
And what this function is actually doing, it reads index page from
disk and iterates over IndexTuple records to get corresponding
HeapPages.
The problem is that xlrec->nitems = 182 in my case, but on the index
page there are only 56 tuples and the rest of the page is filled with
zeros:
(gdb) bt
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f07a2be2801 in __GI_abort () at abort.c:79
#2 0x000055938000f586 in errfinish (dummy=0) at elog.c:557
#3 0x0000559380011d95 in elog_finish (elevel=22, fmt=0x55938007c010
"WAL contains references to invalid pages") at elog.c:1378
#4 0x000055937fbd4e8f in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5 0x000055937fbd58e7 in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6 0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
#7 0x000055937fb8eaa1 in btree_xlog_delete (record=0x55938151b438) at
nbtxlog.c:688
#8 0x000055937fb8f870 in btree_redo (record=0x55938151b438) at nbtxlog.c:1018
#9 0x000055937fbc6101 in StartupXLOG () at xlog.c:6967
#10 0x000055937fe2da0f in StartupProcessMain () at startup.c:216
#11 0x000055937fbda28f in AuxiliaryProcessMain (argc=2,
argv=0x7ffe88f10df0) at bootstrap.c:419
#12 0x000055937fe2c957 in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055937fe27482 in PostmasterMain (argc=16,
argv=0x5593814ed690) at postmaster.c:1321
#14 0x000055937fd7152e in main (argc=16, argv=0x5593814ed690) at main.c:228
(gdb) frame 6
#6 0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
599 hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
(gdb) l 587
582 * Loop through the deleted index items to obtain the
TransactionId from
583 * the heap items they point to.
584 */
585 unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
586
587 for (i = 0; i < xlrec->nitems; i++)
588 {
589 /*
590 * Identify the index tuple about to be deleted
591 */
(gdb) l
592 iitemid = PageGetItemId(ipage, unused[i]);
593 itup = (IndexTuple) PageGetItem(ipage, iitemid);
594
595 /*
596 * Locate the heap page that the index tuple points at
597 */
598 hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
599 hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
600 if (!BufferIsValid(hbuffer))
601 {
(gdb) p i
$45 = 53
(gdb) p unused[i]
$46 = 57
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}

Therefore iitemid points to the beginning of the index page, what
doesn't make sense.

Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!

(gdb) p *(PageHeader)ipage
$49 = {pd_lsn = {xlogid = 2739, xrecoff = 1455373160}, pd_checksum =
3042, pd_flags = 0, pd_lower = 248, pd_upper = 5936, pd_special =
8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp =
0x7f06ed9e9598}

Index page LSN: AB3/56BF3B68
Current LSN: AB3/50323E78

Is there a way to recover from such a situation? Should the postgres
in such case do comparison of LSNs and if the LSN on the page is
higher than the current LSN simply return InvalidTransactionId?
Apparently, if there are no connections open postgres simply is not
running this code and it seems ok.

Regards,
--
Alexander Kukushkin

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-08-25 08:03:22 BUG #15352: postgresql FDW error "ERROR: ORDER BY position 0 is not in select list"
Previous Message Amit Kapila 2018-08-25 03:19:10 Re: Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-08-25 09:20:53 Re: Problem while setting the fpw with SIGHUP
Previous Message Bruce Momjian 2018-08-24 22:50:10 Re: Compile failures today