Re: [PATCHES] Infrastructure changes for recovery

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: List pgsql-patches <pgsql-patches(at)postgresql(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Infrastructure changes for recovery
Date: 2008-09-26 10:20:15
Message-ID: 1222424415.4445.917.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches


On Thu, 2008-09-25 at 18:28 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > Version 7

> Anyway, that's sufficiently bad that I'm bouncing the patch for
> reconsideration.

No problem, I understand this needs discussion.

There's less detail here than first appears. There are some basic points
to consider from which all else follows.

> After reading this for awhile, I realized that there is a rather
> fundamental problem with it: it switches into "consistent recovery"
> mode as soon as it's read WAL beyond ControlFile->minRecoveryPoint.
> In a crash recovery situation that typically is before the last
> checkpoint (if indeed it's not still zero), and what that means is
> that this patch will activate the bgwriter and start letting in
> backends instantaneously after a crash, long before we can have any
> certainty that the DB state really is consistent.
>
> In a normal crash recovery situation this would be easily fixed by
> simply not letting it go to "consistent recovery" state at all, but
> what about recovery from a restartpoint? We don't want a slave that's
> crashed once to never let backends in again. But I don't see how to
> determine that we're far enough past the restartpoint to be consistent
> again. In crash recovery we assume (without proof ;-)) that we're
> consistent once we reach the end of valid-looking WAL, but that rule
> doesn't help for a slave that's following a continuing WAL sequence.
>
> Perhaps something could be done based on noting when we have to pull in
> a WAL segment from the recovery_command, but it sounds like a pretty
> fragile assumption.

Seems like we just say we only signal the postmaster if
InArchiveRecovery. Archive recovery from a restartpoint is still archive
recovery, so this shouldn't be a problem in the way you mention. The
presence of recovery.conf overrides all other cases.

> Some other issues I noted before giving up:

All of these issues raised can be addressed, but I think the main
decision we need to make is not so much about running other processes
but about when it can start and when they have to change mode.

When they can start seems solvable, as above.

When/how they must change state from recovery to normal mode seems more
difficult. State change must be atomic across all processes, but also
done at a micro level so that XLogFlush tests for the state change. The
overhead of continually checking seems high, so I am tempted to say lets
just kick 'em all off and then let them back on again. That's easily
accomplished for bgwriter without anybody noticing much. For Hot Standby
that would mean that a failover would kick off all query backends. I can
see why that would be both desirable and undesirable.

Anyway, from here I propose:
* we keep the shutdown checkpoint
* we kick off bgwriter (and any children) then let 'em back on again so
they can initialise in a different mode.

To do that, I just need to dust off a previous version of the patch. So
we can sort this out quickly if we have a clear way to proceed.

------------------------------------------------------------------
other comments relate to this current patch, so further discussion of
the points below may not be required, if we agree how to proceed as
above.

> * I'm a bit uncomfortable with the fact that the
> IsRecoveryProcessingMode flag is read and written with no lock.
> There's no atomicity or concurrent-write problem, sure, but on
> a multiprocessor with weak memory ordering guarantees (eg PPC)
> readers could get a fairly stale value of the flag. The false
> to true transition happens before anyone except the startup process is
> running, so that's no problem; the net effect is then that backends
> might think that recovery mode was still active for awhile after it
> wasn't. This seems a bit scary, eg in the patch as it stands that'll
> disable XLogFlush calls that should have happened. You could fix that
> by grabbing/releasing some spinlock (any old one) around the accesses,
> but are any of the call sites performance-critical? The one in
> XLogInsert looks like it is, if nothing else.

Agreed.

It's not a dynamic state, so I can fix that inside
IsRecoveryProcessingMode() with a local state to make check faster.

bool
IsRecoveryProcessingMode(void)
{
if (!IsRecoveryProcessingMode)
return false;

{
/* use volatile pointer to prevent code rearrangement */
volatile XLogCtlData *xlogctl = XLogCtl;

SpinLockAcquire(&xlogctl->mode_lck);
RecoveryProcessingMode = XLogCtl->IsRecoveryProcessingMode;
SpinLockRelease(&xlogctl->mode_lck);
}

return IsRecoveryProcessingMode;
}

This only applies if we decide not to kick everybody off, change state
and then let them back on again.

> * I kinda think you broke XLogFlush anyway. It's certainly clear
> that the WARNING case at the bottom is unreachable with the patch,
> and I think that means that you've messed up an important error
> robustness behavior. Is it still possible to get out of recovery mode
> if there are any bad LSNs in the shared buffer pool?

Perhaps. But the WARNING could only occur during shutdown checkpoints.
This specifically patch avoids those, so the case would never arise with
this patch and needs no avoidance. Yes, you can still leave recovery
mode if there are bad LSNs with this patch, but you won't know what they
are because of the lack of the shutdown checkpoint. Probably an argument
in favour of allowing shutdown checkpoints.

> * The use of InRecovery in CreateCheckPoint seems pretty bogus, since
> that function can be called from the bgwriter in which the flag will
> never be true. Either this needs to be IsRecoveryProcessingMode(),
> or it's useless because we'll never create ordinary checkpoints until
> after subtrans.c is up anyway.

Exactly. bgwriter never needs this to be set because it writes
restorepoints before this, using different code path.

> * The patch moves the clearing of InRecovery from after to before
> StartupCLOG, RecoverPreparedTransactions, etc. Is that really a
> good idea? It makes it hard for those modules to know if they are
> coming up after a normal or recovery startup. I think they may not
> care at the moment, but I'd leave that alone without a darn good
> reason to change it.

I didn't move this as you say. It already was before StartupClog.

I moved it into exitRecovery() only, so it was unset in the same way
exitArchiveRecovery sets InArchiveRecovery to false. So refactoring, no
change of sequencing.

> * The comment about CheckpointLock being only pro forma is now wrong,
> if you are going to use locking it to implement a delay in exitRecovery.
> But I don't understand why the delay there. If it's needed, seems like
> the path where a restartpoint *isn't* in progress is wrong --- don't you
> actually need to start one and wait for it?

All of this ducking and diving is because of the bgwriter needing to
perform a state change. That's the ball to keep our eye on.

After much thrashing, I decided that interrupting a restartpoint is too
dangerous a thing to want to do. If we're in the middle of one, we
finish it, if not there's no need to interrupt it.

> Also I note that if the
> LWLockConditionalAcquire succeeds, you neglect to release the lock,
> which surely can't be right.

Doh. Thanks.

> * The tail end of StartupXLOG() looks pretty unsafe to me. Surely
> we mustn't clear IsRecoveryProcessingMode until after we have
> established the safe-recovery checkpoint. The comments there seem to
> be only vaguely related to the current state of the patch, too.

The whole point was to remove the ShutdownCheckpoint, but it sounds like
you're not keen on that any more.

I'm neutral on this point: I can see why people would want it removed -
it will speed up failover. I can see why people would want it kept -
there is a slight window where if we crash we will need to go back to
archive recovery.

I had a workable solution that kept it, so will revert to it.

> * Logging of recoveryLastXTime seems pretty bogus now. It's supposed to
> be associated with a restartpoint completion report, but now it's just
> out in the ether somewhere and doesn't represent a guarantee that we're
> synchronized that far.

That last one was there before so we knew where the log ended. It was
not supposed to be associated with a restartpoint, just with end of log
purely for information (by user request, for when a log file is
corrupted and we need to know "when" we are up to).

> * backup.sgml needs to be updated to say that log_restartpoints is
> obsolete.

Yes

> * There are a bunch of disturbing assumptions in the SLRU-related
> modules about their StartUp calls being executed without any concurrent
> access. This isn't really a problem that needs to be dealt with in this
> patch, I think, but that will all have to be cleaned up before we dare
> allow any backends to run concurrently with recovery.

Well spotted, thanks.

> btree's
> suppression of relcache invals for metapage updates will be a problem
> too.

Again thanks. This patch is stand-alone from later work, thats why.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Russell Smith 2008-09-26 10:55:17 Re: parallel pg_restore - WIP patch
Previous Message Markus Wanner 2008-09-26 09:49:05 Re: Proposal: move column defaults into pg_attribute along with attacl

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2008-09-26 11:42:06 Re: [PATCHES] Infrastructure changes for recovery
Previous Message Tom Lane 2008-09-25 22:28:39 Re: [PATCHES] Infrastructure changes for recovery