Re: block-level incremental backup

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: block-level incremental backup
Date: 2019-04-11 17:29:29
Message-ID: 3e15314c-b8de-2c81-6722-80c33423bc85@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

09.04.2019 18:48, Robert Haas writes:
> Thoughts?
Hi,
Thank you for bringing that up.
In-core support of incremental backups is a long-awaited feature.
Hopefully, this take will end up committed in PG13.

Speaking of UI:
1) I agree that it should be implemented as a new replication command.

2) There should be a command to get only a map of changes without actual
data.

Most backup tools establish server connection, so they can use this
protocol to get the list of changed blocks.
Then they can use this information for any purpose. For example,
distribute files between parallel workers to copy the data,
or estimate backup size before data is sent, or store metadata
separately from the data itself.
Most methods (except straightforward LSN comparison) consist of two
steps: get a map of changes and read blocks.
So it won't add much of extra work.

example commands:
GET_FILELIST [lsn]
returning json (or whatever) with filenames and maps of changed blocks

Map format is also the subject of discussion.
Now in pg_probackup we reuse code from pg_rewind/datapagemap,
not sure if this format is good for sending data via the protocol, though.

3) The API should provide functions to request data with a granularity
of file and block.
It will be useful for parallelism and for various future projects.

example commands:
GET_DATAFILE [filename [map of blocks] ]
GET_DATABLOCK [filename] [blkno]
returning data in some format

4) The algorithm of collecting changed blocks is another topic.
Though, it's API should be discussed here:

Do we want to have multiple implementations?
Personally, I think that it's good to provide several strategies,
since they have different requirements and fit for different workloads.

Maybe we can add a hook to allow custom implementations.

Do we want to allow the backup client to tell what block collection
method to use?
example commands:
GET_FILELIST [lsn] [METHOD lsn | page | ptrack | etc]
Or should it be server-side cost-based decision?

5) The method based on LSN comparison stands out - it can be done in one
pass.
So it probably requires special protocol commands.
for example:
GET_DATAFILES [lsn]
GET_DATAFILE [filename] [lsn]

This is pretty simple to implement and pg_basebackup can use this method,
at least until we have something more advanced in-core.

I'll be happy to help with design, code, review, and testing.
Hope that my experience with pg_probackup will be useful.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-04-11 17:33:54 Re: Pluggable Storage - Andres's take
Previous Message Tom Lane 2019-04-11 17:27:03 Re: setLastTid() and currtid()