Why we really need timelines *now* in PITR

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Why we really need timelines *now* in PITR
Date: 2004-07-17 20:36:09
Message-ID: 16709.1090096569@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

If we do not add timeline numbers to WAL file names, we will be forced
to destroy information during recovery. Consider the following
scenario:

1. You have a WAL directory containing, say, WAL segments 0010 to 0020
(for the purposes of this example I won't bother typing out realistic
16-digit filenames, but just use 4-digit names).

2. You discover that your junior DBA messed up badly and you need to
revert to yesterday evening's state. Let's say the chosen recovery end
time is in the middle of file 0014.

3. You run the recovery process. At its end, the WAL end pointer will
be 0014 and some offset.

If we simply run forward from this situation, then we will be
overwriting existing WAL records in the existing files 0014-0020.
This is bad from the point of view of not wanting to discard information
(what if we decide we should have recovered to a later time??), but
there is an even more serious reason for not doing that. Suppose we
suffer a crash sometime after recovery. On restart, the system will
start replaying the logs, and *there will be nothing to keep it from
replaying all the way to the end of file 0020*. (The files will contain
proper, in-sequence page headers, so the tests that normally detect
recycled log segments won't think there is anything wrong.) This will
leave you with a thoroughly corrupt database.

One way to solve this would be to physically discard 0015-0020 as soon
as we decide we're stopping short of the end of WAL. I think that is
unacceptable on don't-throw-away-information grounds. I think it would
be far better to invent the timeline concept. Then, our old WAL files
would be named say 0001.0010 through 0001.0020, and we would start
logging into 0002.0014 after recovery.

A slightly tricky point is that we have to "sew together" the end of one
timeline and the start of the next --- for instance, we only want the
front part of 0001.0014, not the back part, to be part of the new
timeline. Patrick Macdonald told me about a pretty baroque scheme that
DB2 uses for this, but I think it would be simplest if we just copied
the appropriate amount of data from 0001.0014 into 0002.0014 and then
ran forward from there. Copying a max of 16MB of data doesn't sound
very onerous.

During WAL replay or recovery, there would be a notion of the "target
timeline" that you are trying to recover to a point within. The rule
for selecting which WAL segment file to read is "use the one with
largest timeline number less than or equal to the target, and never less
than the timeline number you used for the previous segment". So for
example if we realized we'd chosen the wrong recovery target time, we
could backpedal and redo the same recovery process with target timeline
0001, ignoring any WAL segments that had been archived with timeline
0002. Alternatively, if we were simply doing crash recovery in timeline
0002, we could stop at (say) segment 0002.0018, and we'd know that we
should ignore 0001.0019 because it is not in our timeline.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2004-07-17 22:09:19 Re: Fun with nested transactions in PL/pgSQL
Previous Message Bruce Momjian 2004-07-17 20:15:21 Re: [HACKERS] Point in Time Recovery