I've been reviewing and massaging the so called recovery infra patch.
To recap, the goal is to:
- start background writer during (archive) recovery
- skip the shutdown checkpoint at the end of recovery. Instead, the
database is brought up immediately, and the bgwriter performs a normal
online checkpoint, while we're already accepting connections.
- keep track of when we reach a consistent point in the recovery, where
we could let read-only backends in. Which is obviously required for hot
The 1st and 2nd points provide some useful functionality, even without
the rest of the hot standby patch.
I've refactored the patch quite heavily, making it a lot simpler, and
over 1/3 smaller than before:
The signaling between the bgwriter and startup process during recovery
was quite complicated. The startup process periodically sent checkpoint
records to the bgwriter, so that bgwriter could perform restart points.
I've replaced that by storing the last seen checkpoint in a shared
memory in xlog.c. CreateRestartPoint() picks it up from there. This
means that bgwriter can decide autonomously when to perform a restart
point, it no longer needs to be told to do so by the startup process.
Which is nice in a standby. What could happen before is that the standby
processes a checkpoint record, and decides not to make it a restartpoint
because not enough time has passed since last one. If we then get a long
idle period after that, we'd really want to make the previous checkpoint
record a restart point after all, after some time has passed. That is
what will happen now, which is a usability enhancement, although the
real motivation for this refactoring was to make the code simpler.
The bgwriter is now always responsible for all checkpoints and
restartpoints. (well, except for a stand-alone backend). Which makes it
easier to understand what's going on, IMHO.
There was one pretty fundamental bug in the minsafestartpoint handling:
it was always set when a WAL file was opened for reading. Which means it
was also moved backwards when the recovery began by reading the WAL
segment containing last restart/checkpoint, rendering it useless for the
purpose it was designed. Fortunately that was easy to fix. Another tiny
bug was that log_restartpoints was not respected, because it was stored
in a variable in startup process' memory, and wasn't seen by bgwriter.
One aspect that troubles me a bit is the changes in XLogFlush. I guess
we no longer have the problem that you can't start up the database if
we've read in a corrupted page from disk, because we now start up before
checkpointing. However, it does mean that if a corrupt page is read into
shared buffers, we'll never be able to checkpoint. But then again, I
guess that's already true without this patch.
I feel quite good about this patch now. Given the amount of code churn,
it requires testing, and I'll read it through one more time after
sleeping over it. Simon, do you see anything wrong with this?
(this patch is also in my git repository at git.postgresql.org, branch
pgsql-hackers by date
|Next:||From: Peter Eisentraut||Date: 2009-01-28 10:18:16|
|Subject: How to get SE-PostgreSQL acceptable|
|Previous:||From: Zdenek Kotala||Date: 2009-01-28 09:28:27|
|Subject: Re: pg_upgrade project status|