Re: 'Invalid lp' during heap_xlog_delete

From: Daniel Wood <hexexpert(at)comcast(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 'Invalid lp' during heap_xlog_delete
Date: 2019-12-11 02:24:46
Message-ID: 1780069540.557505.1576031087353@connect.xfinity.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On December 6, 2019 at 3:06 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
...
> > crash
> > smgrtruncate - Not reached
>
> This seems like a somewhat confusing description to me, because
> smgrtruncate() is what calls DropRelFileNodeBuffers(). I assume what you
> mean by "smgrtruncate" is not the function, but the smgr_truncate()
> callback?

My mistake. Yes, smgr_truncate()

> > Even if we reach the truncate we don't fsync it till the next
> > checkpoint. So on filesystems which delay metadata updates a crash
> > can lose the truncate.
>
> I think that's probably fine though. Leaving the issue of checkpoint
> completing inbetween the DropRelFileNodeBuffers() and the actual
> truncation aside, we'd have the WAL logged truncation truncating the
> file. I don't think it's unreasonable to except a filesystem that claims
> to support running without full_page_writes (I've seen several such
> claims turning out not to be true under load), to preserve either the
> original page contents or the new file size after a a crash. If your
> filesystem doesn't, you really ought not to use it with FPW = off.

If the phsyical truncate doesn't occur in the seconds after the cache invalidation
nor the fsync within the minutes till the next checkpoint you are NOT left
with a torn page on disk. You are left with the 'incorrect' page on disk.
In other words, an older page because the invalidation prevent the write
of the most recent dirty page. Redos don't like old incorrect pages.
But, yes, fullpage writes covers up this anomaly(To be generous).

> > Once we do the fsync(), for the truncate, the REDO read will return
> > BLK_NOTFOUND and the DELETE REDO attempt will be skipped. WIthout the
> > fsync() or crashing before the truncate, the delete redo depends on
> > the most recent version of the page having been written by the
> > checkpoint.
>
> We also need to correctly replay this on a standby, so I don't think
> just adding an smgrimmedsync() is the answer. We'd not replay the
> truncation standbys / during PITR, unless I miss something. So we'd just
> end up with the same problem in slightly different situations.

I haven't mentioned to you that we have seen what appears to be the same
problem during PITR's depending on which base backup we start with. I didn't
mention it because I haven't created a repro to prove it. I simply suspect it.

> To me it sounds the fix here would be to rejigger the RelationTruncate()
> sequence for truncation of the main for as follows:
>
> 1) MyPgXact->delayChkpt = true
> 2) XLogInsert(XLOG_SMGR_TRUNCATE)
> 3) smgrtruncate() (which, as now, first does a DropRelFileNodeBuffers(),
> and then calls the smgr_truncate callback)
> 4) MyPgXact->delayChkpt = false
>
> I'm not worried about the increased delayChkpt = true time. Compared
> with the frequency of RecordTransactionCommit() this seems harmless.

Seems reasonable.

> I'm inclined to think that we should make the XLogFlush() in the
> RelationNeedsWAL() branch of RelationTruncate()
> unconditional. Performing the truncation on the filesystem level without
> actually having persisted the corresponding WAL record is dangerous.

Yes, I also was curious about why it was conditional.

- Dan Wood

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-12-11 03:11:14 Re: Wrong assert in TransactionGroupUpdateXidStatus
Previous Message Thomas Munro 2019-12-11 02:22:41 Re: stress test for parallel workers