Re: Proposal: Incremental Backup

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Gabriele Bartolini <gabriele(dot)bartolini(at)2ndquadrant(dot)it>, desmodemone <desmodemone(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Marco Nenciarini <marco(dot)nenciarini(at)2ndquadrant(dot)it>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Incremental Backup
Date: 2014-08-05 18:23:31
Message-ID: CA+U5nMKabKpkdxF+BAxwkEVnF8yXHPfrhHXpdseOMJQNvnRjtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 August 2014 19:30, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Mon, Aug 4, 2014 at 5:15 AM, Gabriele Bartolini
> <gabriele(dot)bartolini(at)2ndquadrant(dot)it> wrote:
>> I really like the proposal of working on a block level incremental
>> backup feature and the idea of considering LSN. However, I'd suggest
>> to see block level as a second step and a goal to keep in mind while
>> working on the first step. I believe that file-level incremental
>> backup will bring a lot of benefits to our community and users anyway.
>
> Thing is, I don't see how the LSN method is that much harder than an
> on-disk bitmap. In-memory bitmap IMO is just a recipe for disaster.
>
> Keeping a last-updated-LSN for each segment (or group of blocks) is
> just as easy as keeping a bitmap, and far more flexible and robust.
>
> The complexity and cost of safely keeping the map up-to-date is what's
> in question here, but as was pointed before, there's no really safe
> alternative. Nor modification times nor checksums (nor in-memory
> bitmaps IMV) are really safe enough for backups, so you really want to
> use something like the LSN. It's extra work, but opens up a world of
> possibilities.

OK, some comments on all of this.

* Wikipedia thinks the style of backup envisaged should be called "Incremental"
https://en.wikipedia.org/wiki/Differential_backup

* Base backups are worthless without WAL right up to the *last* LSN
seen during the backup, which is why pg_stop_backup() returns an LSN.
This is the LSN that is the effective viewpoint of the whole base
backup. So if we wish to get all changes since the last backup, we
must re-quote this LSN. (Or put another way - file level LSNs don't
make sense - we just need one LSN for the whole backup).

* When we take an incremental backup we need the WAL from the backup
start LSN through to the backup stop LSN. We do not need the WAL
between the last backup stop LSN and the new incremental start LSN.
That is a huge amount of WAL in many cases and we'd like to avoid
that, I would imagine. (So the space savings aren't just the delta
from the main data files, we should also look at WAL savings).

* For me, file based incremental is a useful and robust feature.
Block-level incremental is possible, but requires either significant
persistent metadata (1 MB per GB file) or access to the original
backup. One important objective here is to make sure we do NOT have to
re-read the last backup when taking the next backup; this helps us to
optimize the storage costs for backups. Plus, block-level recovery
requires us to have a program that correctly re-writes data into the
correct locations in a file, which seems likely to be a slow and bug
ridden process to me. Nice, safe, solid file-level incremental backup
first please. Fancy, bug prone, block-level stuff much later.

* One purpose of this could be to verify the backup. rsync provides a
checksum, pg_basebackup does not. However, checksums are frequently
prohibitively expensive, so perhaps asking for that is impractical and
maybe only a secondary objective.

* If we don't want/have file checksums, then we don't need a profile
file and using just the LSN seems fine. I don't think we should
specify that manually - the correct LSN is written to the backup_label
file in a base backup and we should read it back from there. We should
also write a backup_label file to incremental base backups, then we
can have additional lines saying what the source backups were. So full
base backup backup_labels remain as they are now, but we add one
additional line per increment, so we have the full set of increments,
much like a history file.

Normal backup_label files look like this

START WAL LOCATION: %X/%X
CHECKPOINT LOCATION: %X/%X
BACKUP METHOD: streamed
BACKUP FROM: standby
START TIME: ....
LABEL: foo

so we would have a file that looks like this

START WAL LOCATION: %X/%X
CHECKPOINT LOCATION: %X/%X
BACKUP METHOD: streamed
BACKUP FROM: standby
START TIME: ....
LABEL: foo
INCREMENTAL 1
START WAL LOCATION: %X/%X
CHECKPOINT LOCATION: %X/%X
BACKUP METHOD: streamed
BACKUP FROM: standby
START TIME: ....
LABEL: foo incremental 1
INCREMENTAL 2
START WAL LOCATION: %X/%X
CHECKPOINT LOCATION: %X/%X
BACKUP METHOD: streamed
BACKUP FROM: standby
START TIME: ....
LABEL: foo incremental 2
... etc ...

which we interpret as showing the original base backup, then the first
increment, then the second increment etc.. which allows us to recover
the backups in the correct sequence.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-08-05 19:03:01 Re: B-Tree support function number 3 (strxfrm() optimization)
Previous Message Jerry Sievers 2014-08-05 18:12:10 Append to a GUC parameter ?