Re: Point in Time Recovery

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: mascarm(at)mascari(dot)com, ZeugswetterA(at)spardat(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Point in Time Recovery
Date: 2004-07-13 11:38:23
Message-ID: 1089718702.17493.2527.camel@stromboli
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-06 at 22:40, Simon Riggs wrote:
> On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>
> > > - when we stop, keep reading records until EOF, just don't apply them.
> > > When we write a checkpoint at end of recovery, the unapplied
> > > transactions are buried alive, never to return.
> > > - stop where we stop, then force zeros to EOF, so that no possible
> > > record remains of previous transactions.
> >
> > Go with plan B; it's best not to destroy data (what if you chose the
> > wrong restart point the first time)?
> >
> > Actually this now reminds me of a discussion I had with Patrick
> > Macdonald some time ago. The DB2 practice in this connection is that
> > you *never* overwrite existing logfile data when recovering. Instead
> > you start a brand new xlog segment file, which is given a new "branch
> > number" so it can be distinguished from the future-time xlog segments
> > that you chose not to apply. I don't recall what the DB2 terminology
> > was exactly --- not "branch number" I don't think --- but anyway the
> > idea is that when you restart the database after an incomplete recovery,
> > you are now in a sort of parallel universe that has its own history
> > after the branch point (PITR stop point). You need to be able to
> > distinguish archived log segments of this parallel universe from those
> > of previous and subsequent incarnations. I'm not sure whether Vadim
> > intended our StartUpID to serve this purpose, but it could perhaps be
> > used that way, if we reflected it in the WAL file names.
> >
>
> Some more thoughts...focusing on the what do we do after we've finished
> recovering. The objectives, as I see them, are to put the system into a
> state, that preserves these features:
> 1. we never overwrite files, in case we want to re-run recovery
> 2. we never write files that MIGHT have been written previously
> 3. we need to ensure that any xlog records skipped at admins request (in
> PITR mode) are never in a position to be re-applied to this timeline.
> 4. ensure we can re-recover, if we need to, without further problems
>
> Tom's concept above, I'm going to call timelines. A timeline is the
> sequence of logs created by the execution of a server. If you recover
> the database, you create a new timeline. [This is because, if you've
> invoked PITR you absolutely definitely want log records written to, say,
> xlog15 to be different to those that were written to xlog15 in a
> previous timeline that you have chosen not to reapply.]
>
> Objective (1) is complex.
> When we are restoring, we always start with archived copies of the xlog,
> to make sure we don't finish too soon. We roll forward until we either
> reach PITR stop point, or we hit end of archived logs. If we hit end of
> logs on archive, then we switch to a local copy, if one exists that is
> higher than those, we carry on rolling forward until either we reach
> PITR stop point, or we hit end of that log. (Hopefully, there isn't more
> than one local xlog higher than the archive, but its possible).
> If we are rolling forward on local copies, then they are our only
> copies. We'd really like to archive them ASAP, but the archiver's not
> running yet - we don't want to force that situation in case the archive
> device (say a tape) is the one being used to recover right now. So we
> write an archive_status of .ready for that file, ensuring that the
> checkpoint won't remove it until it gets copied to archive, whenever
> that starts working again. Objective (1) met.
>
> When we have finished recovering we:
> - create a new xlog at the start of a new ++timeline
> - copy the last applied xlog record to it as the first record
> - set the record pointer so that it matches
> That way, when we come up and begin running, we never overwrite files
> that might have been written previously. Objective (2) met.
> We do the other stuff because recovery finishes up by pointing to the
> last applied record...which is what was causing all of this extra work
> in the first place.
>
> At this point, we also reset the secondary checkpoint record, so that
> should recovery be required again before next checkpoint AND the
> shutdown checkpoint record written after recovery completes is
> wrong/damaged, the recovery will not autorewind back past the PITR stop
> point and attempt to recover the records we have just tried so hard to
> reverse/ignore. Objective (3) met. (Clearly, that situation seems
> unlikely, but I feel we must deal with it...a newly restored system is
> actually very fragile, so a crash again within 3 minutes or so is very
> commonplace, as far as these things go).
>
> Should we need to re-recover, we can do so because the new timeline
> xlogs are further forward than the old timeline, so never get seen by
> any processes (all of which look backwards). Re-recovery is possible
> without problems, if required. This means you're a lot safer from some
> of the mistakes you might of made, such as deciding you need to go into
> recovery, then realising it wasn't required (or some other painful
> flapping as goes on in computer rooms at 3am).
>
> How do we implement timelines?
> The main presumption in the code is that xlogs are sequential. That has
> two effects:
> 1. during recovery, we try to open the "next" xlog by adding one to the
> numbers and then looking for that file
> 2. during checkpoint, we look for filenames less than the current
> checkpoint marker
> Creating a timeline by adding a larger number to LogId allows us to
> prevent (1) from working, yet without breaking (2).
> Well, Tom does seem to have something with regard to StartUpIds. I feel
> it is easier to force a new timeline by adding a very large number to
> the LogId IF, and only if, we have performed an archive recovery. That
> way, we do not change at all the behaviour of the system for people that
> choose not to implement archive_mode.
>
> Should we implement timelines?
> Yes, I think we should. I've already hit the problems that timelines
> solve in my testing and so that means they'll be hit when you don't need
> the hassle.
>

I'm still wrestling with the cleanup-after-stopping-at-point-in-time
code and have some important conclusions.

Moving forward on a timeline is somewhat tricky for xlogs, as shown
above,...but...

My earlier treatment seems to have neglected to include the clog also.
If we stop before end of log, then we also have potentially many (though
presumably at least one) committed transactions that we do not want to
be told about ever again.

The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).

Please tell me that we can ignore the state of the clog, but I think we
can't - if a new xid re-used a previous xid that had committed AND then
we crashed...we would have inconsistent data. Unless we physically write
zeros to clog for every begin transaction after a recovery...err, no...

The only recourse that I can see is to "truncate the future" of the
clog, which would mean:
- keeping track of the highest xid provided by any record from the xlog,
in xact.c, xact_redo
- using that xid to write zeros to the clog after this point until EOF
- drop any clog segment files past the new "high" segment
- no idea how that effects NT or not...

The timeline idea works for xlog because once we've applied the xlog
records and checkpointed, we can discard the xlog records. We can't do
that with clog records (unless we followed recovery with a vacuum full -
which is possible, but not hugely desirable) - though this doesn't solve
the issue that xlog records don't have any prescribed position in the
file, clog records do.

Right now, I don't know where to start with the clog code and the
opportunity for code-overlap with NT seems way high. These problems can
be conquered, given time and "given enough eyeballs".

I'm all ears for some bright ideas...but I'm getting pretty wary that we
may introduce some unintended features if we try to get this stabilised
within two weeks. My current conclusion is: lets commit archive recovery
in this release, then wait until next dot release for full recovery
target features. We've hit all the features which were a priority and
the fundamental architecture is there, so i think it is time to be happy
with what we've got, for now.

Comments, please....remembering that I'd love it if I've missed
something that simplifies the task. Fire away.

Best regards, Simon Riggs

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Bruno Wolff III 2004-07-13 12:35:35 Re: Slony NG
Previous Message evgeny tsurkin 2004-07-13 11:32:47 inheritance question

Browse pgsql-hackers by date

  From Date Subject
Next Message Zeugswetter Andreas SB SD 2004-07-13 12:18:11 Re: Point in Time Recovery
Previous Message Simon Riggs 2004-07-13 11:05:17 Re: Anoncvs down?

Browse pgsql-patches by date

  From Date Subject
Next Message Greg Sabino Mullane 2004-07-13 11:45:20 Re: Remove confusing commented-defaullts from postgresql.conf
Previous Message Simon Riggs 2004-07-13 11:09:11 Re: PITR Archive Recovery plus WIP PITR