Re: block-level incremental backup

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, Jeevan Ladhe <jeevan(dot)ladhe(at)enterprisedb(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: block-level incremental backup
Date: 2019-07-31 20:03:01
Message-ID: CA+Tgmoaj-zw4Mou4YBcJSkHmQM+JA-dAVJnRP8zSASP1S4ZVgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 31, 2019 at 1:59 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> I feel Robert's suggestion is good.
> We can probably keep one meta file for each backup with some basic information
> of all the files being backed up, this metadata file will be useful in the
> below case:
> Table dropped before incremental backup
> Table truncated and Insert/Update/Delete operations before incremental backup

There's really no need for this with the design I proposed. The files
that should exist when you restore in incremental backup are exactly
the set of files that exist in the final incremental backup, except
that any .partial files need to be replaced with a correct
reconstruction of the underlying file. You don't need to know what
got dropped or truncated; you only need to know what's supposed to be
there at the end.

You may be thinking, as I once did, that restoring an incremental
backup would consist of restoring the full backup first and then
layering the incrementals over it, but if you read what I proposed, it
actually works the other way around: you restore the files that are
present in the incremental, and as needed, pull pieces of them from
earlier incremental and/or full backups. I think this is a *much*
better design than doing it the other way; it avoids any risk of
getting the wrong answer due to truncations or drops, and it also is
faster, because you only read older backups to the extent that you
actually need their contents.

I think it's a good idea to try to keep all the information about a
single file being backup in one place. It's just less confusing. If,
for example, you have a metadata file that tells you which files are
dropped - that is, which files you DON'T have - then what happen if
one of those files is present in the data directory after all? Well,
then you have inconsistent information and are confused, and maybe
your code won't even notice the inconsistency. Similarly, if the
metadata file is separate from the block data, then what happens if
one file is missing, or isn't from the same backup as the other file?
That shouldn't happen, of course, but if it does, you'll get confused.
There's no perfect solution to these kinds of problems: if we suppose
that the backup can be corrupted by having missing or extra files, why
not also corruption within a single file? Still, on balance I tend to
think that keeping related stuff together minimizes the surface area
for bugs. I realize that's arguable, though.

One consideration that goes the other way: if you have a manifest file
that says what files are supposed to be present in the backup, then
you can detect a disappearing file, which is impossible with the
design I've proposed (and with the current full backup machinery).
That might be worth fixing, but it's a separate feature that has
little to do with incremental backup.

> Probably it can also help us to decide which work the worker needs to do
> if we are planning to backup in parallel.

I don't think we need a manifest file for parallel backup. One
process or thread can scan the directory tree, make a list of which
files are present, and then hand individual files off to other
processes or threads. In short, the directory listing serves as the
manifest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-07-31 20:36:07 Re: Unused header file inclusion
Previous Message Konstantin Knizhnik 2019-07-31 19:48:16 Re: [HACKERS] Cached plans and statement generalization