Re: 'Invalid lp' during heap_xlog_delete

From: Andres Freund <andres(at)anarazel(dot)de>
To: Daniel Wood <hexexpert(at)comcast(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 'Invalid lp' during heap_xlog_delete
Date: 2019-12-06 23:06:40
Message-ID: 20191206230640.2dvdjpcgn46q3ks2@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-11-08 12:46:51 -0800, Daniel Wood wrote:
> Page on disk has empty lp 1
> * Insert into page lp 1
>
> checkpoint START. Redo eventually starts here.
> ** Delete all rows on page.

It's worthwhile to note that this part cannot happen without full page
writes disabled. By dint of a checkpoint having stared previously, this
will otherwise always include an FPW (or be marked as WILL_INIT by a
previous record, which functionally is equivalent).

> autovac truncate
> DropRelFileNodeBuffers - dirty page NOT written. lp 1 on disk still empty
> checkpoint completes

If I understand correctly, the DropRelFileNodeBuffers() needs to happen
before the BufferSync() reaches the buffer containing the page with the
deletion, but before the relevant file(s) are truncated. And obviously
the deletion needs to have finished modifying the page in question. Not
a conflict with what you wrote, just confirming.

> crash
> smgrtruncate - Not reached

This seems like a somewhat confusing description to me, because
smgrtruncate() is what calls DropRelFileNodeBuffers(). I assume what you
mean by "smgrtruncate" is not the function, but the smgr_truncate()
callback?

> Even if we reach the truncate we don't fsync it till the next
> checkpoint. So on filesystems which delay metadata updates a crash
> can lose the truncate.

I think that's probably fine though. Leaving the issue of checkpoint
completing inbetween the DropRelFileNodeBuffers() and the actual
truncation aside, we'd have the WAL logged truncation truncating the
file. I don't think it's unreasonable to except a filesystem that claims
to support running without full_page_writes (I've seen several such
claims turning out not to be true under load), to preserve either the
original page contents or the new file size after a a crash. If your
filesystem doesn't, you really ought not to use it with FPW = off.

I do wonder, halfway related, if there's an argument that
XLogReadBufferForRedoExtended() ought to return something other than
BLK_NEEDS_REDO for pages read during recovery that are all-zeroes, at
least for some RBM_* modes.

> Once we do the fsync(), for the truncate, the REDO read will return
> BLK_NOTFOUND and the DELETE REDO attempt will be skipped. WIthout the
> fsync() or crashing before the truncate, the delete redo depends on
> the most recent version of the page having been written by the
> checkpoint.

We also need to correctly replay this on a standby, so I don't think
just adding an smgrimmedsync() is the answer. We'd not replay the
truncation standbys / during PITR, unless I miss something. So we'd just
end up with the same problem in slightly different situations.

> Is DropRelFileNodeBuffers purely for performance or would there be any
> correctness problems if not done.

There would be correctness problems if we left that out - the on-disk
state and the in-memory state would diverge.

To me it sounds the fix here would be to rejigger the RelationTruncate()
sequence for truncation of the main for as follows:

1) MyPgXact->delayChkpt = true
2) XLogInsert(XLOG_SMGR_TRUNCATE)
3) smgrtruncate() (which, as now, first does a DropRelFileNodeBuffers(),
and then calls the smgr_truncate callback)
4) MyPgXact->delayChkpt = false

I'm not worried about the increased delayChkpt = true time. Compared
with the frequency of RecordTransactionCommit() this seems harmless.

I'm inclined to think that we should make the XLogFlush() in the
RelationNeedsWAL() branch of RelationTruncate()
unconditional. Performing the truncation on the filesystem level without
actually having persisted the corresponding WAL record is dangerous.

I think we need to backpatch a fix for this (even if one were to
consider fpw = off unsupported). I think there's enough other nasty edge
cases here. While fpw=on fixes the deletion case at hand, I think you
could very well end up in a nasty situation in other cases where either
redo location or the actual checkpoint record would fall between the WAL
record and the actual truncation. Imagine e.g. a base backup starting in
such a situation - you'd potentially end up with a relation that
contains old data, without later replaying the truncation record. The
window isn't huge, but also not negligible.

I'll start a separate thread about whether we need to do a good bit of
the work in smgrtruncate() / smgrdounlink() / ... in critical sections.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2019-12-06 23:18:14 RE: [Proposal] Level4 Warnings show many shadow vars
Previous Message Andrew Dunstan 2019-12-06 22:21:58 Re: ssl passphrase callback