crash recovery vs partially written WAL

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: crash recovery vs partially written WAL
Date: 2020-12-30 20:52:46
Message-ID: 20201230205246.7pb6ekq63faazrr6@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

A question from a colleague made me wonder if there are scenarios where
two subsequent crashes could lead to wrong WAL to be applied.

Imagine the following scenario
[ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ]
^flush ^write ^insert

if the machine crashes in this moment, we could end up with a situation
where page 1, 3, 4 made it out out to disk, but page 2 wasn't.

That itself is not a problem, when we perform crash recovery, we'll
detect the end of WAL. We'll zero out the invalid parts of page 2, and
log a end-of-recovery checkpoint (which has to fit either onto page 2 or
3).

What I am concerned about is what happens if after crash recovery we
fill up page 3 with new valid records, ending exactly at the page
boundary (i.e. .

[ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ]
^(flush,write)
^insert

if we crash now, we'll peform recovery from the end-fo-recovery record
somewhere on page 2 or 3, and replay the rest of page 3.

That's where I see/wonder about a problem: What guarantees that we find
the contents of xlog page 4 to be invalid? The page header will have the
appropriate xl_pageaddr/tli/info. and because the last record on page 3
ended precisely at the page boundary, there'll not be a xlp_rem_len
allowing us to detect this either.

While we zero out WAL pages in-memory before using them, this won't help
in this instance because a) nothing was inserted into page 4 b) page 4
was never written out.

WAL segment recycling doesn't cause similar problems because xlp_pageaddr
protects us against related issues.

Replaying the old records from page 4 is obviously wrong, since they may
rely on modifications the "old" records on page 2/3 would have performed
(but which got lost).

I don't immediately see a good fix for this. The most obvious thing
would be to explicitly zero-out all WAL files beyond the end-of-recovery
point that have a "correct" xlp_pageaddr, but that may reading a lot of
WAL due to WAL file recycling.

I hope I am missing some crosscheck making this a non-issue?

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Wildish 2020-12-30 21:01:26 Re: [PATCH] Allow queries in WHEN expression of FOR EACH STATEMENT triggers
Previous Message Tom Lane 2020-12-30 20:42:47 Buildfarm's cross-version-upgrade tests