Re: Why we really need timelines *now* in PITR

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Why we really need timelines *now* in PITR
Date: 2004-07-19 18:33:26
Message-ID: 24170.1090262006@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I think there's really no way around the issue: somehow we've got to
> keep some meta-history outside the $PGDATA area, if we want to do this
> in a clean fashion.

After further thought I think we can fix this stuff by creating a
"history file" for each timeline. This will make recovery slightly more
complicated but I don't think it would be any material performance
problem. Here's how it goes:

* Timeline IDs are 32-bit ints with no particular semantic significance
(that is, we do not assume timeline 3 is a child of 2, or anything like
that). The actual parentage of a timeline has to be found by inspecting
its history file.

* History files will be named by their timeline ID, say "00000042.history".
They will be created in /pg_xlog whenever a new timeline is created
by the act of doing a recovery to a point in time earlier than the end
of existing WAL. When doing WAL archiving a history file can be copied
off to the archive area by the existing archiver mechanism (ie, we'll
make a .ready file for it as soon as it's written).

* History files will be plain text (for human consumption) and will
essentially consist of a list of parent timeline IDs in sequence.
I envision adding the timeline split timestamp and starting WAL segment
number too, but these are for documentation purposes --- the system
doesn't need them. We may as well allow comments in there as well,
so that the DBA can annotate the reasons for a PITR split to have been
done. So the contents might look like

# Recover from unintentional TRUNCATE
00000001 0000000A00142568 2005-05-16 12:34:15 EDT
# Ex-assistant DBA dropped wrong table
00000007 0000002200005434 2005-11-17 18:44:44 EST

When we split off a new timeline, we just have to copy the parent's
history file (which we can do verbatim including comments) and then
add a new line at the end showing the immediate parent's timeline ID
and the other details of the split. Initdb can create 00000001.history
with empty contents (since that timeline has no parents).

* When we need to do recovery, we first identify the source timeline
(either by reading the current timeline ID from pg_control, or the DBA
can tell us with a parameter in recovery.conf). We then read the
history file for that timeline, and remember its sequence of parent
timeline IDs. We can crosscheck that pg_control's timeline ID is
one of this set of timeline IDs, too --- if it's not then the wrong
backup file was restored.

* During recovery, whenever we need to open a WAL segment file, we first
try to open it with the source timeline ID; if that doesn't exist, try
the immediate parent timeline ID; then the grandparent, etc. Whenever
we find a WAL file with a particular timeline ID, we forget about all
parents further up in the history, and won't try to open their segments
anymore (this is the generalization of my previous rule that you never
drop down in timeline number as you scan forward).

* If we end recovery because we have rolled forward off the end of WAL,
we can just continue using the source timeline ID --- we are extending
that timeline. (Thus, an ordinary crash and restart doesn't require
generating a new timeline ID; nor do we generate a new line during
normal postmaster stop/start.) But if we stop recovery at a requested
point-in-time earlier than end of WAL, we have to branch off a new
timeline. We do this by:
* Selecting a previously unused timeline ID (see below).
* Writing a history file for this ID, by copying the parent
timeline's history file and adding a new line at the end.
* Copying the last-used WAL segment of the parent timeline,
giving it the same segment number but the new timeline's ID.
This becomes the active WAL segment when we start operating.

* We can identify the highest timeline ID ever used by simply starting
with the source timeline ID and probing pg_xlog and the archive area
for history files N+1.history, N+2.history, etc until we find an ID
for which there is no history file. Under reasonable scenarios this
will not take very many probes, so it doesn't seem that we need any
addition to the archiver API to make it more efficient.

* Since history files will be small and made infrequently (one hopes you
do not need to do a PITR recovery very often...) I see no particular
reason not to leave them in /pg_xlog indefinitely. The DBA can clean
out old ones if she is a neatnik, but I don't think the system needs to
or should delete them. Similarly the archive area could be expected to
retain history files indefinitely.

* However, you *can* throw away a history file once you are no longer
interested in rolling back to times predating the splitoff point of the
timeline. If we don't find a history file we can just act as though the
timeline has no parents (extends indefinitely far in the past). (Hm,
so we don't actually have to bother creating 00000001.history...)

* I'm intending to replace the current concept of StartUpID (SUI) by
timeline IDs --- we'll record timeline IDs not SUIs in data page headers
and WAL page headers. SUI isn't doing anything of value for us; I think
it was probably intended to do what timelines will do, but it's not
defined quite right for the purpose. One good thing about timeline IDs
for WAL page headers is that we know exactly which IDs should be
expected in a WAL file (either the current timeline or one of its
parents); this allows a much tighter check than is possible with SUIs.

Anybody see any holes in this design?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2004-07-19 20:24:52 Re: [HACKERS] Point in Time Recovery
Previous Message Rod Taylor 2004-07-19 18:23:24 Re: pg_dump bug fixing