Re: block-level incremental backup

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: block-level incremental backup
Date: 2019-04-10 16:21:03
Message-ID: CA+TgmobFVe4J4AA7z9OMUzKnm09Tt+sybhxeL_Ddst3q3wqpzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 10, 2019 at 10:57 AM Jehan-Guillaume de Rorthais
<jgdr(at)dalibo(dot)com> wrote:
> My idea would be create a new tool working on archived WAL. No burden
> server side. Basic concept is:
>
> * parse archives
> * record latest relevant FPW for the incr backup
> * write new WALs with recorded FPW and removing/rewriting duplicated walrecords.
>
> It's just a PoC and I hadn't finished the WAL writing part...not even talking
> about the replay part. I'm not even sure this project is a good idea, but it is
> a good educational exercice to me in the meantime.
>
> Anyway, using real life OLTP production archives, my stats were:
>
> # WAL xlogrec kept Size WAL kept
> 127 39% 50%
> 383 22% 38%
> 639 20% 29%
>
> Based on this stats, I expect this would save a lot of time during recovery in
> a first step. If it get mature, it might even save a lot of archives space or
> extend the retention period with degraded granularity. It would even help
> taking full backups with a lower frequency.
>
> Any thoughts about this design would be much appreciated. I suppose this should
> be offlist or in a new thread to avoid polluting this thread as this is a
> slightly different subject.

Interesting idea, but I don't see how it can work if you only deal
with the FPWs and not the other records. For instance, suppose that
you take a full backup at time T0, and then at time T1 there are two
modifications to a certain block in quick succession. That block is
then never touched again. Since no checkpoint intervenes between the
modifications, the first one emits an FPI and the second does not.
Capturing the FPI is fine as far as it goes, but unless you also do
something with the non-FPI change, you lose that second modification.
You could fix that by having your tool replicate the effects of WAL
apply outside the server, but that sounds like a ton of work and a ton
of possible bugs.

I have a related idea, though. Suppose that, as Peter says upthread,
you have a replication slot that prevents old WAL from being removed.
You also have a background worker that is connected to that slot. It
decodes WAL and produces summary files containing all block-references
extracted from those WAL records and the associated LSN (or maybe some
approximation of the LSN instead of the exact value, to allow for
compression and combining of nearby references). Then you hold onto
those summary files after the actual WAL is removed. Now, when
somebody asks the server for all blocks changed since a certain LSN,
it can use those summary files to figure out which blocks to send
without having to read all the pages in the database. Although I
believe that a simple system that finds modified blocks by reading
them all is good enough for a first version of this feature and useful
in its own right, a more efficient system will be a lot more useful,
and something like this seems to me to be probably the best way to
implement it.

The reason why I think this is likely to be superior to other possible
approaches, such as the ptrack approach Konstantin suggests elsewhere
on this thread, is because it pushes the work of figuring out which
blocks have been modified into the background. With a ptrack-type
approach, the server has to do some non-zero amount of extra work in
the foreground every time it modifies a block. With an approach based
on WAL-scanning, the work is done in the background and nobody has to
wait for it. It's possible that there are other considerations which
aren't occurring to me right now, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2019-04-10 16:32:35 Re: Cleanup/remove/update references to OID column
Previous Message David Rowley 2019-04-10 16:14:11 Re: Should the docs have a warning about pg_stat_reset()?