PITR Functional Design v2 for 7.5

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: PITR Functional Design v2 for 7.5
Date: 2004-03-08 23:28:25
Message-ID: 004701c40565$0fb93bd0$f3bd87d9@LaptopDellXP
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

PITR Functional Design v2 for 7.5

Currently, PostgreSQL provides Crash Recovery but not yet full Point In
Recovery (PITR). The following document provides a design which enhances
the existing robustness features to include full PITR. Since one of the
primary objectives for PITR is robustness, this design is provided in
advance of patches to allow those features and behaviours to be
subjected to the rigours of [HACKERS] before final coding is attempted.
We're really not that far now from making this work, hence the attention
on up-front planning.

Thanks for your comments, Best Regards, Simon Riggs, 2nd Quadrant

Review of current Crash Recovery

Crash recovery is catered for by the use of WAL logging, or xlogs. xlogs
written to disk immediately before a transaction is acknowledged as
committed. xlogs contain REDO information sufficient to rollforward any
changes from a known starting position. The known starting position is
also recorded by keeping track of which transactions have completed in a
file structure known as the clog. Clogs are also written to disk as
transactions commit.

The changed data pages are not written immediately back to disk. They do
need to be because of the entries in the xlog and clog taken together
sufficient to recover from a crash. Every so often a full checkpoint
process is created that will perform a full synchronisation of changed
"dirty") data pages back to disk. When a checkpoint is complete it will
write the last transaction id to the xlog as a marker, and it will trim
clog files to the last transaction id. The frequency of checkpoints is
controllable. Changed data pages are written back to disk as a
process called the bg_writer (or "lazy" writer), reducing the effect of
checkpoints on busy workloads.

In crash recovery, the database files are presumed to be intact, but not
necessarily up to date. When postmaster comes up again, it checks clog
discover what the last checkpointed transaction id was. Using this, it
scans through the available xlog files to the marker written by the
checkpoint at that time. Following REDO entries are then reapplied to
data pages as far as possible until the system is brough to the best
available point.

If the appropriate xlogs are not available, no recovery is possible.

Following initdb, there will be at least 1 xlog. As new data is written
xlog, new files will be allocated as required. As a result of
there will be a time when xlogs are no longer required for crash
At each checkpoint, if there is an xlog that is no longer required the
one will be recycled or removed. Xlogs will be recycled back to the
of the queue", so that we do not need to delete and create files
A certain maximum number of files will be kept as preallocated logs;
this limit is controllable. When the limit is reached, xlogs will be
removed rather than being recycled. As a result, the number of xlogs may
vary considerably over time, but mostly they will cycle around
maintaining roughly steady state number of xlogs, therefore with
predictably constant space utilisation.

If an xlog cannot be written because the space available is full then
transaction that depended upon the xlog write will not be able to
nor will any subsequent transactions until the space situation
Currently, this imposes a limit in the size of any transaction based
the available diskspace in the pg_xlog directory.

Xlogs are relatively high volume, clogs are relatively low. An out of
condition on clog is typically unlikely.

Failure analysis:
- If a transaction fails, no changes will be committed to xlog and the
entry will show the transaction aborted.
- If a transaction succeeds, its changes are committed to xlog and the
entry shows the transactions succeeded.
- If xlog directory fills or is otherwise unwritable, a PANIC is raised
- If clog directory fills or is otherwise unwritable, a PANIC is raised

Point in Time Recovery (PITR)

PITR features are designed to extend the existing Crash Recovery
features so
that a recovery can take place in situations where a crash recovery
not have been possible. These situations are:
- database objects have been dropped
- xlogs do not go back far enough in time to allow rollforward recovery
- the database files are not intact and need to be completely replaced
before rollforward

To do this, a full physical backup of the system is required for
When tablespaces are available, it should be possible to restore and
individual tablespaces. In addition, xlogs will need to be moved out of
the normal xlog filesystem to an archive destination.

PITR Proposed Solution

The proposed solution is to allow the existing crash recovery detection
and rollforward logic to be utilised directly to perform PITR, which
should allow the minimum number of changes and additional code.

To allow this to occur, the full backup *must* occur while the database
is open or "hot". This backup must include all data and clogs (and any
tablespaces or logical links utilised). A continuous sequence of xlogs
must also be available, stretching from the last checkpoint prior to the
start of the backup through to whatever time is specified for the
"recovery point" or until the end of the xlogs.

The full PITR solution consists of a number of components:
1. xlog archival
2. recovery-to-point-in-time (RPIT)

1. Xlog Archival
There are a wide range of Backup and Recovery (BAR) products on the
market, both open source and commercially licensed programs that provide
facilities to perform full physical backups and individual file
archives. The best way to foster wide adoption of PostgreSQL is to allow
it to work in conjunction with any of these products. To this end, a
PostgreSQL archival API is specified that will allow both PostgreSQL and
an external archiving program to work together in a coordinated manner
to achieve the backup of the xlogs.

Archival API will need to be implemented directly into the postgreSQL
server, though will also require a reference implementation of the API
allow it to be copied and more widely used. The reference API is also
required to allow the workings of the API to be sufficiently well tested
allow its release into the mainstream PostgreSQL code. These together
require the following two sub-components:
1.1 XLogArchive API
1.2 pg_arch: simple xlog archiving tool

1.1 XLogArchive API
1.1.1 XLogArchive Initiation

The API assumes that all xlogs produced by PostgreSQL will need to be
archived. This is a requirement, since any break in the sequence of
xlogs will render the total archive useless for restoring forward from
the last backup.

When PostgreSQL server starts, it will check the value of the parameter
wal_archive_policy and enable/disable archiving accordingly. This
parameter can only be changed at server start. (This is required because
the initial step of archiving each xlog is performed by the backend; if
this were changeable after boot, then it might be possible for an
individual backend to override the wal_archive_policy and choose not to
archive - which would then effect the whole system and all users, not
just the user making that choice). It is considered less desirable to
utilize a compiler directive, since the archival policy is a
operational/business decision for a particular database not a developer
activity on the dbms executable.

It is not defined whether the external archiver starts before
PostgreSQL, or soon afterwards. Clearly, it is intended that the two
should work together at the direction of the administrator. This slight
lack of clarity is intended to allow for the situation where start-up is
invoked within automated boot sequence, where sub-system start-up order
may not be guaranteed by the OS. It also allows for variation between
the start-up times for PostgreSQL and the archiver; the archiver might
be ready in an instant or require manual intervention such as a new tape

There is no requirement for the archiver to halt when PostgreSQL shuts
down, though may choose to do so or not, e.g. it may be desirable to
have one archiver operate for multiple postmasters simultaneously. The
archiver knows many things about PostgreSQL, including its data
directory, so is easily able to read PID file and monitor postmaster if
it chooses to do so. No additions to the API are required in that area.

As a result there is no "connection" concept between PostgreSQL and the
archiver, as there are in other client-server APIs (libpq, tcp/ip, JDBC
etc). So no connection and no disconnection. Similarly, there is no
environment set up/tear down at this time.

1.1.1 XLogArchive API specification

(PostgreSQL ->) XLogArchiveNotify(xlog)
(<- Archiver) XLogArchiveXlogs()
(<- Archiver) XLogArchiveComplete(xlog)
(PostgreSQL ->) XLogArchiveBusy(xlog) returns ARCHIVE_OK or BUSY

When writing to xlog switches to the next file, the older file will be
closed. At this point, the postgresql backend which caused the xlog file
switch will then call

XLogArchiveNotify(xlog) returns TRUE or FALSE.

TRUE indicates successful notification, though not necessarily receipt
that notification by the archiver.
FALSE indicates unsuccessful notification, which is a PANIC condition,
this situation should not occur and yet the administrator that requested
that the archival process should occur.

Since the call is made by a
particular user's backend, it is important that this call can be made in
minimum time and is not dependent upon the external archiver, i.e. the
is asynchronous. No two backends will call this at exactly the same
though it is possible that one call will not have completed before
call executes. Should multiple calls be in progress at the same time
will be notifying that separate xlogs are ready for archiving, so there
no reason to require logical locks. The notify call should be written in
such a way that allows multiple calls to be active simultaneously, i.e.
critical sections or single-threading.

The archiver initially starts in a wait loop, waking up regularly to

XLogArchiveXlogs() returns a single XLOG filename, or NULL

If an xlog file is waiting to be archived, then the archiver will
the name of the xlog by using this API call. If more than one file is
available to be archived, then it will be ignored. If the archiver is
multi-threaded, it need not wait until it has executed
before it executes XLogArchiveXlogs again.

The archiver can now use the name of the xlog retrieved to visit pg_xlog
diretory and copy that xlog away to a place that it considers safe. When
this occurs to its satisfaction, the archiver will call

XLogArchiveComplete(xlog) returns SUCCESS, ALREADY_NOTIFIED and

SUCCESS indicates successful notification, though not necessarily
receipt of
that notification by the archiver.
ALREADY_NOTIFIED indicates an error that XLogArchiveComplete had already
been called for that xlog. This indicates to the archiver either that
multiple archivers are active, or that this archiver has already called
ArchiveComplete for that xlog which it should not be doing twice.
SEVERE_ERROR indicates unsuccessful notification. The archiver is
to retry this operation a number of times to ensure that this condition
certain, then raise a priority human alert to correct the situation.
Allowance must be made to retry this call again following intervention.

This is an asynchronous call, so there is no expectation that postgresql
will immediately receive this notification. There is no assumption that
archive copying must be single-threaded, or that the archiver must copy
files in the order that they become available. It is presumed that the
archiver has been granted read-only access by the administrator; no
should be available for copy other than as a result of direct security
authorisation. No xlogs may be altered or deleted by the archiver in any
way. There is no assumption that archival time is bounded in time,
though it
is strongly desirable that the archiver make best efforts to copy away
files and then call ArchiveComplete as quickly and as consistently as
possible. Recognition is made that copying to tape or across networks
have considerable time variances, caused by physical tape media changes
bandwidth prioritisation etc.. If there is any known planned or there
some regular delays in this process, then the archiver is strongly
encouraged to implement a two-stage process: copy files to a more
consistently performing location, such as another directory on the same
system before external archival occurs.

At the normal point when xlogs are considered for deletion, i.e. after a
checkpoint, the postgresql checkpoint process will call
XLogArchiveBusy(xlog) returns ARCHIVE_OK, BUSY or SEVERE ERROR
ARCHIVE_OK indicates successful archival completion and that the xlog
now be removed by postgresql.
BUSY indicates that the archiver has not yet notified archivercomplete
the xlog should not yet be removed. It is possible that calling this
function against a particular xlog may return FALSE many times in a row.
Once TRUE is returned for any xlog, there is no meaning attached to
it again for the same xlog: the TRUE/FALSE status in that situation is
SEVERE_ERROR indicates that no information about Busy or not could be
This is an asynchronous call, so postgresql will not wait for that xlog
complete archiving. This call is currently not likely to be called
simultaneously because it is called by the checkpoint process.
XLogArchiveFree should not itself remove xlogs from the pg_xlog
The existing mechanisms for xlog removal/recycling will be used so that
PITR does not interfere with the existing crash recovery facilities.

The archival API is designed to work with only one archiver per
If there were more than one archiver acting independently of one
whichever calls XLogArchiveComplete first for a particular xlog would
allow postgresql to remove/recycle the file. If multiple archive copies
xlog files are required, a single archiver must coordinate the
production of
those multiple copies.

1.1.2 XLogArchival API Failure analysis
Failure conditions are mostly noted above.
Some behavioural notes may also be helpful:
If it regularly takes longer to archive than it does to switch xlogs,
then there will be a build up of xlog data. Timing analysis:
- denote the time taken between postgres Notifying that an xlog can now
archived and the attempt to recycle that same xlog as Tc.
to successful XLogArchiveFree)
- denote the time taken between archiver receiving notification and
completing the archival as Ta. (XLogArchiveNotify to
- denote number of xlogs as Nx
- denote capacity of xlog filesystem, in terms of number of xlogs as Nc

If Ta > Tc then N will increase.
However, we expect that as Nx increases, Tc will also increase, given a
constant xlog write rate (very roughly same as constant transaction
There should be a point at which Tc increases such that Ta = Tc, at
time Nx should reach a constant limit or steady state, Nc.
If Nc < Nf then everything will be fine, if however Nc > Nf, then we
will get an out of space condition. (Put another way, there may not be a
steady state before we hot the out of space condition).

The out of space condition could therefore occur in two ways:
1. there is a single delay during which xlog filesystem fills
2. there could be a systematic delay which builds slowly until the xlog
filesystem fills
(1) is only preventable by the archival program, or by processes by
that program is operated/administered
(2) is possibly preventable by either:
i) keeping track of any delay and reporting it
ii) releasing a WARNING when XLogArchiveFree returns BUSY when called
than a certain number of times on nay particular xlog, or returns BUSY
the first call for multiple consecutive xlogs.
Solving (2) is somewhat complicated in that the postgresql checkpoint
process is spawned once per checkpoint, so cannot maintain context/state
information between checkpoints. Another mechanism may be possible, such
a shared memory area or disk file that can be read by subsequent

PITR will act identically to Crash recovery when it hits an out-of-space
condition on the xlog directory, since it is exactly the same code. The
behaviour is to operate in "Fail Safe" mode.

It is possible that an administrator may wish to choose to keep
PostgreSQL up and to begin dropping log files rather than eventually
crash. If that choice was made AND a full physical backup was not yet
available, then there is a window of risk during which if a catastrophic
failure occurred then some committed transactions would not be
recoverable. It is not considered appropriate for anybody other than the
administrator to make this choice and so an option is planned to allow
"Fail Operational" behaviour (i.e. dropping logs) to be added.

Not sure at this time whether this scheme will work successfully if the
full backup spans multiple checkpoints. It is expected that this would
work if individual tablespaces were synchronised to different
checkpoints however.

1.1.3 XLogArchive API Implementation:
The API definition has been separated from the implementation. This
allow either a better implementation to be more easily applied in the
future, and/or specific customisation to take place for particular ports

The initial proposal is a simple scheme that uses file existence & file
extension to pass information between PostgreSQL and the archiver. This
would take place in a peer directory of pg_xlog and pg_clog which has
been named the pg_rlog directory. (r as in the strong first syllable
"ar" in English pronunciation of "archive")

The use of a separate directory allows control over the security and
behaviour of the archiver: the archiver never has cause to create/delete
any files in critical PostgreSQL directories even if security isn't
enforced correctly. Only PostgreSQL will ever delete old xlog data by

XLogArchiveNotify(xlog) returns TRUE or FALSE.
Will write a file called <XLOG>.full to the pg_rlog directory, where
<XLOG> is a filename in the pattern currently used by PostgreSQL xlogs.
The file will contain <XLOG>, Date/Time info.
If correctly written, returns TRUE, else FALSE.

Archiver will scan pg_rlog directory. If it sees an rlog that shows as
.full, it will then rename the rlog entry to <XLOG>.busy and then it
will copy (away to the archive location) the xlog in the pg_xlog
directory that matches the name of the rlog entry.

XLogArchiveComplete(xlog) returns SUCCESS, ALREADY_NOTIFIED and
When this is complete, it will rename the rlog entry so that its
now ends with <XLOG>.done. If all OK, this returns SUCCESS. If the rlog
entry has already been renamed to <XLOG>.done, then the archiver

XLogArchiveBusy(xlog) returns ARCHIVE_OK, BUSY or SEVERE_FAILURE
If <XLOG>.done exists, then returns ARCHIVE_OK, which then allows <XLOG>
to be recycled/removed. If <XLOG>.busy still exists then returns BUSY.
If <XLOG>.full is not available returns SEVERE_FAILURE.

Src/backend/utils/guc.c will be modified to add any config parameters

Src/backend/access/transam/Xlog.c will be modified to implement the
PostgreSQL side calls: XLogArchiveNotify(xlog) and

C implementation of archiver-side API calls: XLogArchiveXlogs()

1.2 pg_arch: simple xlog archiving tool

Src/tools/ will add:
a single-threaded program that uses libpgarch.c to use API, but
a simple copy facility from pg_xlog to another directory. The program
continue to wait and watch for archived files: it is not a file-filter
of program. It may be run as a foreground process (for testing etc),
is also designed to be run as a background process, typically executed
the same time as postmaster startup (through a mechanism such as service
autostart mechanisms following system boot).
pg_arch has two parameters:
-D data-file root for particular instance of PostgreSQL
-A archive directory

2. Recovery to Point-in-Time (RPIT)

Recovery to will offer these options:

2.1 Recovery to end of logs (last time)
2.2 Recovery of all available on-line logs
2.3 Point in time recovery to the checkpoint AT or the last checkpoint
before the time specified.

The administrator is expected to be responsible for placing archived
xlogs back into the pg_xlog directory. This may be a facility provided
by the external archiver, a manual or other automated process. If any
mistakes are made at this point then the administrator can then reselect
appropriate xlogs and try again. There is no enforced limit to the
number of recovery attempts possible.

2.1 Recovery to end of logs
Default option requires no additional/changed PostgreSQL code. Archive
API will be tested using this option.

2.2 Recovery of all available on-line logs
This will be made available as a command-line switch on postmaster. This
will allow roll-forward on xlogs until all available logs are recovered,
then the postmaster will shut down.
This can be used in two ways:
- when the xlog archive exceeds available disk space: following
execution in this mode, the administrator would recover PostgreSQL in
batches. When the last batch is reached, the command switch would no
longer be used.

2.3 RPIT
Add a feature to accept a recovery parameter and to halt recovery when
that time is reached.

3. Possible future extensions
The implementation in 1.1.3 may well be improved upon, or it may also be
implemented differently altogether according to the architecture of the
archiving program.

Suggestions have been made to introduce a generalised notification
interface. If such was available, it would be straightforward to
the archival API to utilise this. It's outside of the aims of this
development to consider that.

It is foreseen that the API would be able to be used to form the basis
of an
XBSA or NDMP client application that could then work easily with the
enterprise storage management products.



Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2004-03-08 23:50:53 Re: Out of space situation and WAL log pre-allocation (was
Previous Message Bruce Momjian 2004-03-08 23:25:14 Catching up