Re: 'Invalid lp' during heap_xlog_delete

From: Daniel Wood <hexexpert(at)comcast(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 'Invalid lp' during heap_xlog_delete
Date: 2019-11-13 02:23:33
Message-ID: 1335373813.287510.1573611814107@connect.xfinity.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It's been tedious to get it exactly right but I think I got it. FYI, I was delayed because today we had yet another customer hit this: 'redo max offset' error. The system crashed as a number of autovacuums and a checkpoint happened and then the REDO failure.

Two tiny code changes:
bufmgr.c:bufferSync() pg_usleep(10000000); // At begin of function

smgr.c:smgrtruncate(): Add the following just after CacheInvalidateSmgr()
if (forknum == MAIN_FORKNUM && nblocks == 0) {
pg_usleep(40000000);
{ char *cp=NULL; *cp=13; }
}

Now for the heavily commented SQL repro. It will require that you execute a checkpoint in another session when instructed by the repro.sql script. You have 4 seconds to do that. The repro script explains exactly what must happen.

-----------------------------------------------------------
create table t (c char(1111));
alter table t alter column c set storage plain;
-- Make sure there actually is an allocated page 0 and it is empty.
-- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: return BLK_NOTFOUND;
-- Hopefully two row deletes don't trigger autovacuum and truncate the empty page.
insert into t values ('1'), ('2');
delete from t;
checkpoint; -- Checkpoint the empty page to disk.
-- This insert should be before the next checkpoint 'start'. I don't want to replay it.
-- And, yes, there needs to be another checkpoint completed to skip its replay and start
-- with the replay of the delete below.
insert into t values ('1'), ('2');
-- Checkpoint needs to start in another session. However, I need to stall the checkpoint
-- to prevent it from writing the dirty page to disk until I get to the vacuum below.
select 'Please start checkpoint in another session';
select pg_sleep(4);
-- Below is the problematic delete.
-- It succeeds now(online) because the dirty page has two rows on it.
-- However, with respect to crash recovery there are 3 possible scenarios depending on timing.
-- 1) The ongoing checkpoint might write the page with the two rows on it before
-- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works
-- because the page read from disk has the rows on it.
-- 2) The ongoing checkpoint might write the page just after the deletes.
-- In that case BLK_DONE will happen and there'll be no problem. LSN check.
-- 3) The checkpoint can fail to write the dirty page because a vacuum can call
-- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page.
-- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND
-- happens and we skip the redo of the delete. So the skipped dirty write is OK.
-- The problme happens if we crash after the 2nd checkpoint completes
-- but before the physical truncate 'mdtruncate()'.
delete from t;
-- The vacuum must complete DropRelFileNodeBuffers.
-- The vacuum must sleep for a few seconds to allow the checkpoint to complete
-- such that recovery starts with the Delete REDO.
-- We must crash before mdtruncate() does the physical truncate. If the physical
-- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO skipped.
vacuum t;
--------------------------------------------------------

> On November 10, 2019 at 11:51 PM Michael Paquier < michael(at)paquier(dot)xyz mailto:michael(at)paquier(dot)xyz > wrote:
>
>
> On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote:
>
> > > I repro'ed on PG11 and PG10 STABLE but several months old.
> > I looked at 6d05086 but it doesn't address the core issue.
> >
> > DropRelFileNodeBuffers prevents the checkpoint from writing all
> > needed dirty pages for any REDO's that exist BEFORE the truncate.
> > If we crash after a checkpoint but before the physical truncate then
> > the REDO will need to replay the operation against the dirty page
> > that the Drop invalidated.
> >
> > > I am beginning to look at this thread more seriously, and I'd like to
> first try to reproduce that by myself. Could you share the steps you
> used to do that? This includes any manual sleep calls you may have
> added, the timing of the crash, manual checkpoints, debugger
> breakpoints, etc. It may be possible to extract some more generic
> test from that.
> --
> Michael
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-11-13 02:38:12 Re: [HACKERS] Block level parallel vacuum
Previous Message btkimurayuzk 2019-11-13 01:55:36 Re: Add SQL function to show total block numbers in the relation