Re: Why we really need timelines *now* in PITR

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Why we really need timelines *now* in PITR
Date: 2004-07-18 21:10:47
Message-ID: 1090185047.17493.19173.camel@stromboli
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 2004-07-17 at 21:36, Tom Lane wrote:
> If we do not add timeline numbers to WAL file names, we will be forced
> to destroy information during recovery. Consider the following
> scenario:
>
> 1. You have a WAL directory containing, say, WAL segments 0010 to 0020
> (for the purposes of this example I won't bother typing out realistic
> 16-digit filenames, but just use 4-digit names).
>
> 2. You discover that your junior DBA messed up badly and you need to
> revert to yesterday evening's state. Let's say the chosen recovery end
> time is in the middle of file 0014.
>
> 3. You run the recovery process. At its end, the WAL end pointer will
> be 0014 and some offset.
>
> If we simply run forward from this situation, then we will be
> overwriting existing WAL records in the existing files 0014-0020.
> This is bad from the point of view of not wanting to discard information
> (what if we decide we should have recovered to a later time??), but
> there is an even more serious reason for not doing that. Suppose we
> suffer a crash sometime after recovery. On restart, the system will
> start replaying the logs, and *there will be nothing to keep it from
> replaying all the way to the end of file 0020*. (The files will contain
> proper, in-sequence page headers, so the tests that normally detect
> recycled log segments won't think there is anything wrong.) This will
> leave you with a thoroughly corrupt database.
>
> One way to solve this would be to physically discard 0015-0020 as soon
> as we decide we're stopping short of the end of WAL. I think that is
> unacceptable on don't-throw-away-information grounds. I think it would
> be far better to invent the timeline concept. Then, our old WAL files
> would be named say 0001.0010 through 0001.0020, and we would start
> logging into 0002.0014 after recovery.
>
> A slightly tricky point is that we have to "sew together" the end of one
> timeline and the start of the next --- for instance, we only want the
> front part of 0001.0014, not the back part, to be part of the new
> timeline. Patrick Macdonald told me about a pretty baroque scheme that
> DB2 uses for this, but I think it would be simplest if we just copied
> the appropriate amount of data from 0001.0014 into 0002.0014 and then
> ran forward from there. Copying a max of 16MB of data doesn't sound
> very onerous.
>

Well, yes - I completely agree that we need the timeline concept as one
of the highest priorities. I originally raised the problem timelines
solve because of the errors I had experienced re-running restores many
times with the same archive set. It's just too easy to overwrite log
files without the timeline concept.

IMHO you don't need to change the xlog format as a necessary step to
introduce timelines. Simply adding FFFF to the logid is sufficient
(which lets face it takes a heck of long time before it gets to 1...)

[Also, as an extra detail on your analysis, when recovery is finished
you need to move both primary and secondary checkpoint markers forwards
to the new timeline, so that crash recovery can't go back to the old
timeline]

If you're going to change xlog filenames, then I would think that adding
the system identifier to the xlogs would be a very good addition. I
would simply have recommended keeping them in separate directories, but
putting it on the name would be best. PostgreSQL doesn't have a name
concept...which would be the thing to use if it did.

> During WAL replay or recovery, there would be a notion of the "target
> timeline" that you are trying to recover to a point within. The rule
> for selecting which WAL segment file to read is "use the one with
> largest timeline number less than or equal to the target, and never less
> than the timeline number you used for the previous segment". So for
> example if we realized we'd chosen the wrong recovery target time, we
> could backpedal and redo the same recovery process with target timeline
> 0001, ignoring any WAL segments that had been archived with timeline
> 0002. Alternatively, if we were simply doing crash recovery in timeline
> 0002, we could stop at (say) segment 0002.0018, and we'd know that we
> should ignore 0001.0019 because it is not in our timeline.
>

That sounds like the way it should work.

The way you write this makes me think you might mean you would allow: we
can start recovering in one timelines, then rollforward takes us through
all the timeline nexus points required to get us to the target timeline.

I had imagined that recovery would only ever be allowed to start and end
on the same timeline. I think you probably mean that?

Another of the issues I was thinking through was what happens at the end
of your scenario abobe
- You're on timeline 1 and you need to perform recovery.
- You perform recovery and timeline 2 is created.
- You discover another error and decide to recover again.
- You recover timeline 1 again: what do you name the new timeline
created? 2 or 3? If you call it 2 you will be overwriting data just like
you would have done - which is why timelines were invented, so thats got
to be a bad plan. If you think to avoid this by calling it 3, how do you
know to do that?

My imperfect solution to that was to use a randomised future timeline
number, reducing greatly the chance of ever conflicting on timeline
names. There's probably a solution to this used by other RDBMS, but I
don't know what it is and haven't gone looking on the basis that is
likely to be patented anyway...

[If you do this by adding a big number to the LogId, then 0002.0018 is
simply numerically larger than 0001.0019, so wouldn't ever be
considered.]

You then don't need the idea of a target timeline explicitly - and
therefore the user can't get wrong which timeline their on/want to be
on.

Best regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc G. Fournier 2004-07-18 21:36:30 Re: Toward better documentation
Previous Message David Fetter 2004-07-18 21:08:09 Re: Toward better documentation