Re: BUG #15346: Replica fails to start after the crash

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15346: Replica fails to start after the crash
Date: 2018-08-29 06:59:16
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-bugs pgsql-hackers


2018-08-29 8:17 GMT+02:00 Michael Paquier <michael(at)paquier(dot)xyz>:
> Hmm. That does not sound right to me. If the page has a header LSN
> higher than the one replayed, we should not see it as consistency has
> normally been reached. XLogReadBufferExtended() seems to complain in
> your case about a page which should not exist per the information of
> your backtrace. What's the length of relation file at this point? If
> the relation is considered as having less blocks, shouldn't we just
> ignore it if we're trying to delete items on it and return
> InvalidTransactionId? I have to admit that I am not the best specialist
> with this code.

This is what pg_controldata reports:
Latest checkpoint location: AB3/4A1B30A8
Prior checkpoint location: AB3/4A1B30A8
Latest checkpoint's REDO location: AB3/4149B340
Latest checkpoint's REDO WAL file: 0000000500000AB300000041
Latest checkpoint's TimeLineID: 5
Latest checkpoint's PrevTimeLineID: 5
Latest checkpoint's full_page_writes: on

Minimum recovery ending location: AB3/4A1B3118
Min recovery ending loc's timeline: 5
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on

Therefore it reaches consistency at AB3/4A1B3118 and starts accepting
A few moments later it starts replaying 0000000500000AB300000050,
where "AB3/50323E78" contains "Btree/DELETE: 182".
This record in the pg_xlogdump looks like:
rmgr: Btree len (rec/tot): 436/ 547, tx: 518902810, lsn:
AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0:
rel 1663/18055/212907 blk 72478 FPW

212907 -- btree index:
$ ls -al data/base/18055/212907*
-rw------- 1 akukushkin akukushkin 594812928 Aug 22 07:22 data/base/18055/212907
-rw------- 1 akukushkin akukushkin 163840 Aug 22 07:22

212875 -- the table:
$ ls -al data/base/18055/212875*
-rw------- 1 akukushkin akukushkin 1073741824 Aug 22 07:20
-rw------- 1 akukushkin akukushkin 1014947840 Aug 22 07:21
-rw------- 1 akukushkin akukushkin 532480 Aug 22 07:21
-rw------- 1 akukushkin akukushkin 65536 Aug 22 07:21

As you can see its size is only about 2GB.

> hblkno looks also unsanely high to me if you look at the other blkno
> references you are mentioning upthread.

hblkno is calculated from the data which postgres reads from indexfile
data/base/18055/212907, block 72478.
This block has only 56 index tuples, while there expected to be at
least 182, therefore starting from tuple 57 the tuple pointer starts
looking as:
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}

Basically it points to the beginning of the page, what doesn't make
sense at all.

Why the block 72478 of index relfile doesn't meet our expectations
(contains so few tuples)?
The answer to this question is in the page header. LSN, written in the
indexpage header is AB3/56BF3B68.
That has only one meaning, while the postgres was working before the
crash it managed to apply WAL stream til at least AB3/56BF3B68, what
is far ahead of "Minimum recovery ending location: AB3/4A1B3118".

Alexander Kukushkin

In response to


Browse pgsql-bugs by date

  From Date Subject
Next Message Ashutosh Bapat 2018-08-29 08:55:06 Re: BUG #15352: postgresql FDW error "ERROR: ORDER BY position 0 is not in select list"
Previous Message Michael Paquier 2018-08-29 06:17:23 Re: BUG #15346: Replica fails to start after the crash

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Kukushkin 2018-08-29 07:09:08 Re: Postmaster doesn't send SIGTERM to bgworker during fast shutdown when pmState == PM_STARTUP
Previous Message Peter Eisentraut 2018-08-29 06:32:32 Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)