Re: BUG #15346: Replica fails to start after the crash

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15346: Replica fails to start after the crash
Date: 2018-08-23 08:01:08
Message-ID: CAFh8B==4gpwor6ODercru0iifqwd0nQrdRvXzmorWkMRQa4q=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi Michae,

It aborts not right after reaching a consistent state, but a little bit later:
"consistent recovery state reached at AB3/4A1B3118"
"WAL contains references to invalid pages",,,,,"xlog redo at
AB3/50323E78 for Btree/DELETE: 182 items"

Wal segments 0000000500000AB30000004A and 0000000500000AB300000050
correspondingly.

Also I managed to get a backtrace:

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fd3f1deb801 in __GI_abort () at abort.c:79
#2 0x00005644f7094aaf in errfinish (dummy=dummy(at)entry=0) at
./build/../src/backend/utils/error/elog.c:557
#3 0x00005644f7096a49 in elog_finish (elevel=elevel(at)entry=22,
fmt=fmt(at)entry=0x5644f70e5998 "WAL contains references to invalid
pages") at ./build/../src/backend/utils/error/elog.c:1378
#4 0x00005644f6d8ec18 in log_invalid_page (node=...,
forkno=forkno(at)entry=MAIN_FORKNUM, blkno=blkno(at)entry=179503104,
present=<optimized out>) at
./build/../src/backend/access/transam/xlogutils.c:95
#5 0x00005644f6d8f168 in XLogReadBufferExtended (rnode=...,
forknum=forknum(at)entry=MAIN_FORKNUM, blkno=179503104,
mode=mode(at)entry=RBM_NORMAL) at
./build/../src/backend/access/transam/xlogutils.c:515
#6 0x00005644f6d59448 in btree_xlog_delete_get_latestRemovedXid
(record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:593
#7 btree_xlog_delete (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:682
#8 btree_redo (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:1012
#9 0x00005644f6d83c34 in StartupXLOG () at
./build/../src/backend/access/transam/xlog.c:6967
#10 0x00005644f6f27953 in StartupProcessMain () at
./build/../src/backend/postmaster/startup.c:216
#11 0x00005644f6d925a7 in AuxiliaryProcessMain (argc=argc(at)entry=2,
argv=argv(at)entry=0x7fff1f893a80) at
./build/../src/backend/bootstrap/bootstrap.c:419
#12 0x00005644f6f24983 in StartChildProcess (type=StartupProcess) at
./build/../src/backend/postmaster/postmaster.c:5300
#13 0x00005644f6f27437 in PostmasterMain (argc=16,
argv=0x5644f7d07f10) at
./build/../src/backend/postmaster/postmaster.c:1321
#14 0x00005644f6d08b5b in main (argc=16, argv=0x5644f7d07f10) at
./build/../src/backend/main/main.c:228

From looking on xlogutils.c I actually figured out that this backtrace
is a bit misleading because it can't get to the line 515
So I checked out REL9_6_10 tag and build it with -O0, now it makes more sense.
(gdb) bt
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f9ae0777801 in __GI_abort () at abort.c:79
#2 0x000055c9f285f56d in errfinish (dummy=0) at elog.c:557
#3 0x000055c9f2861d7c in elog_finish (elevel=22, fmt=0x55c9f28cbff0
"WAL contains references to invalid pages") at elog.c:1378
#4 0x000055c9f2424e76 in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5 0x000055c9f24258ce in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6 0x000055c9f23de8ce in btree_xlog_delete_get_latestRemovedXid
(record=0x55c9f3cbf438) at nbtxlog.c:593
#7 0x000055c9f23dea88 in btree_xlog_delete (record=0x55c9f3cbf438) at
nbtxlog.c:682
#8 0x000055c9f23df857 in btree_redo (record=0x55c9f3cbf438) at nbtxlog.c:1012
#9 0x000055c9f24160e8 in StartupXLOG () at xlog.c:6967
#10 0x000055c9f267d9f6 in StartupProcessMain () at startup.c:216
#11 0x000055c9f242a276 in AuxiliaryProcessMain (argc=2,
argv=0x7ffea746c260) at bootstrap.c:419
#12 0x000055c9f267c93e in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055c9f2677469 in PostmasterMain (argc=16,
argv=0x55c9f3c91690) at postmaster.c:1321
#14 0x000055c9f25c1515 in main (argc=16, argv=0x55c9f3c91690) at main.c:228

> After close to an hour of repeated tests, I am not able to reproduce
> that. Maybe there are some specifics in your schema causing more
> certain types of records to be created. Could you check if in the set
> of WAL segments replayed you have references to the block the hash table
> is complaining about?

I am sorry, but don't really know where to look :(

Regards,
--
Alexander Kukushkin

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-08-23 09:32:54 BUG #15348: Postgres 8.4 accepte les connexions entrantes
Previous Message Michael Paquier 2018-08-23 05:36:32 Re: BUG #15347: Unaccent for greek characters does not work

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-08-23 08:12:33 Re: [PATCH] Add regress test for pg_read_all_stats role
Previous Message Andreas Joseph Krogh 2018-08-23 07:14:56 Re: Query is over 2x slower with jit=on