Re: Lots of FSM-related fragility in transaction commit

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Lots of FSM-related fragility in transaction commit
Date: 2011-12-08 08:29:40
Message-ID: 4EE07574.5020309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08.12.2011 08:20, Tom Lane wrote:
> I thought that removing the unreadable file would let me restart,
> but after "rm 52860_fsm" and trying again to start the server,
> there's a different problem:
>
> LOG: database system was interrupted while in recovery at 2011-12-08 00:56:11 EST
> HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
> LOG: database system was not properly shut down; automatic recovery in progress
> LOG: consistent recovery state reached at 0/5D71BA8
> LOG: redo starts at 0/5100050
> WARNING: page 18 of relation base/27666/52860 is uninitialized
> CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18
> PANIC: WAL contains references to invalid pages
> CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18
> LOG: startup process (PID 14507) was terminated by signal 6
> LOG: aborting startup due to startup process failure
>
> Note that this isn't even the same symptom Shraibman hit, since
> apparently he was failing on replaying the commit record. How is it
> that the main table file managed to have uninitialized pages?
> I suspect that "redo visible" WAL processing is breaking one or more of
> the fundamental WAL rules, perhaps by not generating a full-page image
> when it needs to.
>
> So this is really a whole lot worse than our behavior was in pre-FSM
> days, and it needs to get fixed.

This bug was actually introduced only recently. Notice how the log says
"consistent recovery state reached at 0/5D71BA8". This interacts badly
with Fujii's patch I committed last week:

> commit 1e616f639156b2431725f7823c999486ca46c1ea
> Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
> Date: Fri Dec 2 10:49:54 2011 +0200
>
> During recovery, if we reach consistent state and still have entries in the
> invalid-page hash table, PANIC immediately. Immediate PANIC is much better
> than waiting for end-of-recovery, which is what we did before, because the
> end-of-recovery might not come until months later if this is a standby
> server.
> ...

Recovery thinks it has reached consistency early on, so it PANICs as
soon as it sees a reference to a page that belongs to a table that was
later dropped.

The bug here is that we consider having immediately reached consistency
during crash recovery. That's a false notion: during crash recovery the
database isn't consistent until *all* WAL has been replayed. We should
not set reachedMinRecoveryPoint during crash recovery at all. And
perhaps the flag should be renamed to reachedConsistency, to make it
clear that during crash recovery it's not enough to reach the nominal
minRecoveryPoint.

That was harmless until last week, because reachedMinRecoveryPoint was
not used for anything unless you're doing archive recovery and hot
standby was enabled, but IMO the "consistent recovery state reached" log
message was misleading even then. I propose that we apply the attached
patch to master and backbranches.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
dont-set-reachedMinRecoveryPoint-in-crash-recovery-1.patch text/x-diff 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2011-12-08 10:41:56 Re: Allow substitute allocators for PGresult.
Previous Message Tom Lane 2011-12-08 06:20:17 Lots of FSM-related fragility in transaction commit