Re: block-level incremental backup

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: block-level incremental backup
Date: 2019-04-22 17:44:25
Message-ID: CA+TgmoYDDddjknLegF6kFJbPXvtJAHjjS+qagfLUAfdyvuk62A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > I think we're getting closer to a meeting of the minds here, but I
> > don't think it's intrinsically necessary to rewrite the whole method
> > of operation of pg_basebackup to implement incremental backup in a
> > sensible way.
>
> It wasn't my intent to imply that the whole method of operation of
> pg_basebackup would have to change for this.

Cool.

> > One could instead just do a straightforward extension
> > to the existing BASE_BACKUP command to enable incremental backup.
>
> Ok, how do you envision that? As I mentioned up-thread, I am concerned
> that we're talking too high-level here and it's making the discussion
> more difficult than it would be if we were to put together specific
> ideas and then discuss them.
>
> One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> optional parameter and then having the database server scan the entire
> cluster and send a tarball which contains essentially a 'diff' file of
> some kind for each file where we can construct a diff based on the LSN,
> and then the complete contents of the file for everything else that
> needs to be in the backup.

/me scratches head. Isn't that pretty much what I described in my
original post? I even described what that "'diff' file of some kind"
would look like in some detail in the paragraph of that emailed
numbered "2.", and I described the reasons for that choice at length
in http://postgr.es/m/CA+TgmoZrqdV-tB8nY9P+1pQLqKXp5f1afghuoHh5QT6ewdkJ6g@mail.gmail.com

I can't figure out how I'm managing to be so unclear about things
about which I thought I'd been rather explicit.

> So, sure, that would work, but it wouldn't be able to be parallelized
> and I don't think it'd end up being very exciting for the external tools
> because of that, but it would be fine for pg_basebackup.

Stop being such a pessimist. Yes, if we only add the option to the
BASE_BACKUP command, it won't directly be very exciting for external
tools, but a lot of the work that is needed to do things that ARE
exciting for external tools will have been done. For instance, if the
work to figure out which blocks have been modified via WAL-scanning
gets done, and initially that's only exposed via BASE_BACKUP, it won't
be much work for somebody to write code for a new code that exposes
that information directly through some new replication command.
There's a difference between something that's going in the wrong
direction and something that's going in the right direction but not as
far or as fast as you'd like. And I'm 99% sure that everything I'm
proposing here falls in the latter category rather than the former.

> On the other hand, if you added new commands for 'list of files changed
> since this LSN' and 'give me this file' and 'give me this file with the
> changes in it since this LSN', then pg_basebackup could work with that
> pretty easily in a single-threaded model (maybe with two connections to
> the backend, but still in a single process, or maybe just by slurping up
> the file list and then asking for each one) and the external tools could
> leverage those new capabilities too for their backups, both full backups
> and incremental ones. This also wouldn't have to change how
> pg_basebackup does full backups today one bit, so what we're really
> talking about here is the direction to take the new code that's being
> written, not about rewriting existing code. I agree that it'd be a bit
> more work... but hopefully not *that* much more, and it would mean we
> could later add parallel backup to pg_basebackup more easily too, if we
> wanted to.

For purposes of implementing parallel pg_basebackup, it would probably
be better if the server rather than the client decided which files to
send via which connection. If the client decides, then every time the
server finishes sending a file, the client has to request another
file, and that introduces some latency: after the server finishes
sending each file, it has to wait for the client to finish receiving
the data, and it has to wait for the client to tell it what file to
send next. If the server decides, then it can just send data at top
speed without a break. So the ideal interface for pg_basebackup would
really be something like:

START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;

...returning a cookie that can be then be used by each participant for
an argument to a new commands:

JOIN_PARALLLEL_BACKUP 'cookie';

However, that is obviously extremely inconvenient for third-party
tools. It's possible we need both an interface like this -- for use
by parallel pg_basebackup -- and a
START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
interface for use by external tools. On the other hand, maybe the
additional overhead caused by managing the list of files to be fetched
on the client side is negligible. It'd be interesting to see, though,
how busy the server is when running an incremental backup managed by
an external tool like BART or pgbackrest on a cluster with a gazillion
little-tiny relations. I wonder if we'd find that it spends most of
its time waiting for the client.

> What I'm afraid will be lackluster is adding block-level incremental
> backup support to pg_basebackup without any support for managing
> backups or anything else. I'm also concerned that it's going to mean
> that people who want to use incremental backup with pg_basebackup are
> going to have to write a lot of their own management code (probably in
> shell scripts and such...) around that and if they get anything wrong
> there then people are going to end up with bad backups that they can't
> restore from, or they'll have corrupted clusters if they do manage to
> get them restored.

I think that this is another complaint that basically falls into the
category of saying that this proposal might not fix everything for
everybody, but that complaint could be levied against any reasonable
development proposal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-04-22 17:46:18 Re: [PATCH v1] Add \echo_stderr to psql
Previous Message Fabien COELHO 2019-04-22 17:42:38 Re: Trouble with FETCH_COUNT and combined queries in psql