block-level incremental backup

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: block-level incremental backup
Date: 2019-04-09 15:48:38
Message-ID: CA+TgmoYxQLL=mVyN90HZgH0X_EUrw+aZ0xsXJk7XV3-3LygTvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
developed technology that permits a block-level incremental backup to
be taken from a PostgreSQL server. I believe the idea in all of those
cases is that non-relation files should be backed up in their
entirety, but for relation files, only those blocks that have been
changed need to be backed up. I would like to propose that we should
have a solution for this problem in core, rather than leaving it to
each individual PostgreSQL company to develop and maintain their own
solution. Generally my idea is:

1. There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value. There are
several possible ways for the server to implement this, the simplest
of which is to just scan all the blocks and send only the ones that
satisfy that criterion. That might sound dumb, but it does still save
network bandwidth, and it works even without any prior setup. It will
probably be more efficient in many cases to instead scan all the WAL
generated since that LSN and extract block references from it, but
that is only possible if the server has all of that WAL available or
can somehow get it from the archive. We could also, as several people
have proposed previously, have some kind of additional relation for
that stores either a single is-modified bit -- which only helps if the
reference LSN for the is-modified bit is older than the requested LSN
but not too much older -- or the highest LSN for each range of K
blocks, or something like that. I am at the moment not too concerned
with the exact strategy we use here. I believe we may want to
eventually support more than one, since they have different
trade-offs.

2. When you use pg_basebackup in this way, each relation file that is
not sent in its entirety is replaced by a file with a different name.
For example, instead of base/16384/16417, you might get
base/16384/partial.16417 or however we decide to name them. Each such
file will store near the beginning of the file a list of all the
blocks contained in that file, and the blocks themselves will follow
at offsets that can be predicted from the metadata at the beginning of
the file. The idea is that you shouldn't have to read the whole file
to figure out which blocks it contains, and if you know specifically
what blocks you want, you should be able to reasonably efficiently
read just those blocks. A backup taken in this manner should also
probably create some kind of metadata file in the root directory that
stops the server from starting and lists other salient details of the
backup. In particular, you need the threshold LSN for the backup
(i.e. contains blocks newer than this) and the start LSN for the
backup (i.e. the LSN that would have been returned from
pg_start_backup).

3. There should be a new tool that knows how to merge a full backup
with any number of incremental backups and produce a complete data
directory with no remaining partial files. The tool should check that
the threshold LSN for each incremental backup is less than or equal to
the start LSN of the previous backup; if not, there may be changes
that happened in between which would be lost, so combining the backups
is unsafe. Running this tool can be thought of either as restoring
the backup or as producing a new synthetic backup from any number of
incremental backups. This would allow for a strategy of unending
incremental backups. For instance, on day 1, you take a full backup.
On every subsequent day, you take an incremental backup. On day 9,
you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full
day2. On each subsequent day you do something similar. Now you can
always roll back to any of the last seven days by combining the oldest
backup you have (which is always a synthetic full backup) with as many
newer incrementals as you want, up to the point where you want to
stop.

Other random points:
- If the server has multiple ways of finding blocks with an LSN
greater than or equal to the threshold LSN, it could make a cost-based
decision between those methods, or it could allow the client to
specify the method to be used.
- I imagine that the server would offer this functionality through a
new replication command or a syntax extension to an existing command,
so it could also be used by tools other than pg_basebackup if they
wished.
- Combining backups could also be done destructively rather than, as
proposed above, non-destructively, but you have to be careful about
what happens in case of a failure.
- The pg_combinebackup tool (or whatever we call it) should probably
have an option to exploit hard links to save disk space; this could in
particular make construction of a new synthetic full backup much
cheaper. However you'd better be careful not to use this option when
actually trying to restore, because if you start the server and run
recovery, you don't want to change the copies of those same files that
are in your backup directory. I guess the server could be taught to
complain about st_nlink > 1 but I'm not sure we want to go there.
- It would also be possible to collapse multiple incremental backups
into a single incremental backup, without combining with a full
backup. In the worst case, size(i1+i2) = size(i1) + size(i2), but if
the same data is modified repeatedly collapsing backups would save
lots of space. This doesn't seem like a must-have for v1, though.
- If you have a SAN and are taking backups using filesystem snapshots,
then you don't need this, because your SAN probably already uses
copy-on-write magic for those snapshots, and so you are already
getting all of the same benefits in terms of saving storage space that
you would get from something like this. But not everybody has a SAN.
- I know that there have been several previous efforts in this area,
but none of them have gotten to the point of being committed. I
intend no disrespect to those efforts. I believe I'm taking a
slightly different view of the problem here than what has been done
previously, trying to focus on the user experience rather than, e.g.,
the technology that is used to decide which blocks need to be sent.
However it's possible I've missed a promising patch that takes an
approach very similar to what I'm outlining here, and if so, I don't
mind a bit having that pointed out to me.
- This is just a design proposal at this point; there is no code. If
this proposal, or some modified version of it, seems likely to be
acceptable, I and/or my colleagues might try to implement it.
- It would also be nice to support *parallel* backup, both for full
backups as we can do them today and for incremental backups. But that
sound like a separate effort. pg_combinebackup could potentially
support parallel operation as well, although that might be too
ambitious for v1.
- It would also be nice if pg_basebackup could write backups to places
other than the local disk, like an object store, a tape drive, etc.
But that also sounds like a separate effort.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-04-09 15:51:21 Re: Zedstore - compressed in-core columnar storage
Previous Message Konstantin Knizhnik 2019-04-09 15:45:26 Re: Zedstore - compressed in-core columnar storage