Re: Proposals for PITR

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: simon(at)2ndquadrant(dot)com
Cc: pgsql-hackers-pitr(at)postgresql(dot)org
Subject: Re: Proposals for PITR
Date: 2004-02-15 18:13:04
Message-ID: 9293.1076868784@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-hackers-pitr

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> The general PITR recovery scenario requires:
> A - Take a full database backup
> B - Take regular log file backups
> C - Restore a full backup of a database.

Note that pg_dump has absolutely zero to do with this. The base for a
PITR recovery scenario has to be a physical dump, not a logical dump;
else WAL replay won't work.

The conclusion that we came to in previous discussions was that the
physical dump can be taken simply by doing "tar" (or your favorite
alternative tool) on the $PGDATA directory tree --- you could optionally
exclude the pg_xlog directory but must include all else. It is okay to
do this while the database is live, so long as the series of log files
you will use to recover starts from a checkpoint that occurred before
you began the tar run. When you do this the contents of the tar archive
will be an inconsistent snapshot of the database and therefore not
directly useful. However, if you load the tar archive and then replay
WAL from a prior checkpoint to any time after completion of the tar run,
you will have brought the database back to consistency.

Part A of the problem therefore only requires tar plus enough management
software to keep track of the start and stop times of the tar run and
correlate the start time to the WAL log series. It is worth pointing
out though that you need to start up your Part B solution (archiving off
WAL files as they come free) *before* you start taking your base dump.

(When we get around to implementing tablespaces, things might get more
complicated, since you'd need to remember to archive tablespace trees
that might not live under $PGDATA. But we can ignore that refinement
for now, I think.)

> We should assume that the system onto which the restore takes place is
> completely different from the system on which the backup was taken.

It can't be too different, since you will need a binary-compatible
server; you won't be able to move across machine architectures this way.

> - Add entries to WAL files so it contains enough information for
> recovery after a full backup
> This is assumed to be complete. [I'm not sure where the role of pg_clog
> is in all of this, so this may be a broken assumption.] Comments?

I believe that the major problems (lack of recording of file creation/
deletion) were solved by the J.R. Nield patches I applied last week.
There may be some minor issues left to fix but I can't think of any
showstoppers.

> - Write application to archive WAL files to tape, disk, or network
> Probably need to do first part, but I'm arguing not to do the copy to
> tape..

I'd like to somehow see this handled by a user-supplied program or
script. What we mainly need is to define a good API that lets the
archiver program understand which WAL segment files to archive when.

> B - Backing up WAL log files
> -Ordinarily, when old log segment files are no longer needed, they are
> recycled (renamed to become the next segments in the numbered sequence).
> This means that the data within them must be copied from there to
> another location
> AFTER postgres has closed that file
> BEFORE it is renamed and recycled

My inclination would be to change the backend code so that as soon as a
WAL segment is completed, it is flagged as being ready to dump to tape
(or wherever). Possibly the easiest way to do this is to rename the
segment file somehow, perhaps "nnn" becomes "nnn.full". Then, after the
archiver process has properly dumped the file, reflag it as being dumped
(perhaps rename to "nnn.done"). Obviously there are any number of ways
we could do this flagging, and depending on an OS rename facility might
not be the best.

A segment then can be recycled when it is both (a) older than the latest
checkpoint and (b) flagged as dumped. Note that this approach allows
dumping of a file to start before the first time at which it could be
recycled. In the event of a crash and restart, WAL replay has to be
able to find the flagged segments, so the flagging mechanism can't be
one that would make this impossible.

> Think about
> -what will happen if postgres tries to reuse file while we are still
> copying.

This is a non-problem; segments that haven't been recycled can't become
reuse targets.

> -what will happen if copy fails?

This is the archiver's problem to deal with. It only gets to be a
serious problem when you run out of disk space for WAL segment files,
so in most scenarios there is time for manual intervention to fix any
such problem and restart the archiver.

> -Manual indicates that current WAL format is bulky and would require
> some compressed format to be implemented. Initially, I suggest ignoring
> this and simply relying of OS or hardware/tape compression methods.

This is definitely something we could leave for later.

> With full OS file backup, if the database is shutdown correctly, then we
> will need a way to tell the database "you think you're up to date, but
> you're not - I've added some more WAL files into the directories, so
> roll forward on those now please".

I do not think this is an issue either, because my vision of this does
not include tar backups of shutdown databases. What will be backed up
is a live database, therefore the postmaster will definitely know that
it needs to perform WAL replay. What we will need is hooks to make sure
that the full set of required log files is available. It's entirely
possible that that set of log files exceeds available disk space, so it
needs to be possible to run WAL replay incrementally, loading and then
replaying additional log segments after deleting old ones.

Possibly we could do this with some postmaster command-line switches.
J. R. Nield's patch embodied an "interactive recovery" backend mode,
which I didn't like in detail but the general idea is not necessarily
wrong.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2004-02-15 20:55:36 Re: [PATCHES] dollar quoting
Previous Message Tom Lane 2004-02-15 17:22:23 Re: Proposed Query Planner TODO items

Browse pgsql-hackers-pitr by date

  From Date Subject
Next Message Simon Riggs 2004-02-16 23:07:25 Re: Proposals for PITR
Previous Message Marc G. Fournier 2004-02-13 04:37:32 Re: [HACKERS] PITR Dead horse?