Re: block-level incremental backup

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: block-level incremental backup
Date: 2019-04-17 21:20:03
Message-ID: 20190417212003.GG6197@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > > > I love the general idea of having additional facilities in core to
> > > > support block-level incremental backups. I've long been unhappy that
> > > > any such approach ends up being limited to a subset of the files which
> > > > need to be included in the backup, meaning the rest of the files have to
> > > > be backed up in their entirety. I don't think we have to solve for that
> > > > as part of this, but I'd like to see a discussion for how to deal with
> > > > the other files which are being backed up to avoid needing to just
> > > > wholesale copy them.
> > >
> > > I assume you are talking about non-heap/index files. Which of those are
> > > large enough to benefit from incremental backup?
> >
> > Based on discussions I had with Andrey, specifically the visibility map
> > is an issue for them with WAL-G. I haven't spent a lot of time thinking
> > about it, but I can understand how that could be an issue.
>
> If I understand correctly, the VM contains 1 byte per 4 heap pages and
> the FSM contains 1 byte per heap page (plus some overhead for higher
> levels of the tree). Since the FSM is not WAL-logged, I'm not sure
> there's a whole lot we can do to avoid having to back it up, although
> maybe there's some clever idea I'm not quite seeing. The VM is
> WAL-logged, albeit with some strange warts that I have the honor of
> inventing, so there's more possibilities there.
>
> Before worrying about it too much, it would be useful to hear more
> about the concerns related to these forks, so that we make sure we're
> solving the right problem. It seems difficult for a single relation
> to be big enough for these to be much of an issue. For example, on a
> 1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
> = 32MB. Not nothing, but 32MB of useless overhead every time you back
> up a 1TB database probably isn't going to break the bank. It might be
> more of a concern for users with many small tables. For example, if
> somebody has got a million tables with 1 page in each one, they'll
> have a million data pages, a million VM pages, and 3 million FSM pages
> (unless the new don't-create-the-FSM-for-small-tables stuff in v12
> kicks in). I don't know if it's worth going to a lot of trouble to
> optimize that case. Creating a million tables with 100 tuples (or
> whatever) in each one sounds like terrible database design to me.

As I understand it, the problem is not with backing up an individual
database or cluster, but rather dealing with backing up thousands of
individual clusters with thousands of tables in each, leading to an
awful lot of tables with lots of FSMs/VMs, all of which end up having to
get copied and stored wholesale. I'll point this thread out to him and
hopefully he'll have a chance to share more specific information.

> > > > I'm quite concerned that trying to graft this on to pg_basebackup
> > > > (which, as you note later, is missing an awful lot of what users expect
> > > > from a real backup solution already- retention handling, parallel
> > > > capabilities, WAL archive management, and many more... but also is just
> > > > not nearly as developed a tool as the external solutions) is going to
> > > > make things unnecessairly difficult when what we really want here is
> > > > better support from core for block-level incremental backup for the
> > > > existing external tools to leverage.
> > >
> > > I think there is some interesting complexity brought up in this thread.
> > > Which options are going to minimize storage I/O, network I/O, have only
> > > background overhead, allow parallel operation, integrate with
> > > pg_basebackup. Eventually we will need to evaluate the incremental
> > > backup options against these criteria.
> >
> > This presumes that we're going to have multiple competeing incremental
> > backup options presented, doesn't it? Are you aware of another effort
> > going on which aims for inclusion in core? There's been past attempts
> > made, but I don't believe there's anyone else currently planning to or
> > working on something for inclusion in core.
>
> Yeah, I really hope we don't end up with dueling patches. I want to
> come up with an approach that can be widely-endorsed and then have
> everybody rowing in the same direction. On the other hand, I do think
> that we may support multiple options in certain places which may have
> the kinds of trade-offs that Bruce mentions. For instance,
> identifying changed blocks by scanning the whole cluster and checking
> the LSN of each block has an advantage in that it requires no prior
> setup or extra configuration. Like a sequential scan, it always
> works, and that is an advantage. Of course, for many people, the
> competing advantage of a WAL-scanning approach that can save a lot of
> I/O will appear compelling, but maybe not for everyone. I think
> there's room for two or three approaches there -- not in the sense of
> competing patches, but in the sense of giving users a choice based on
> their needs.

I can agree with the idea of having multiple options for how to collect
up the set of changed blocks, though I continue to feel that a
WAL-scanning approach isn't something that we'd have implemented in the
backend at all since it doesn't require the backend and a given backend
might not even have all of the WAL that is relevant. I certainly don't
think it makes sense to have a backend go get WAL from the archive to
then merge the WAL to provide the result to a client asking for it-
that's adding entirely unnecessary load to the database server.

As such, only the LSN-based scanning of relation files to produce the
set of changed blocks seems to make sense to me to implement in the
backend.

Just to be clear- I don't have any problem with a tool being implemented
in core to support the scanning of WAL to produce a changeset, I just
don't think that's something we'd have built into the *backend*, nor do
I think it would make sense to add that functionality to the replication
(or any other) protocol, at least not with support for arbitrary LSN
starting and ending points.

A thought that occurs to me is to have the functions for supporting the
WAL merging be included in libcommon and available to both the
independent executable that's available for doing WAL merging, and to
the backend to be able to WAL merging itself- but for a specific
purpose: having a way to reduce the amount of WAL that needs to be sent
to a replica which has a replication slot but that's been disconnected
for a while. Of course, there'd have to be some way to handle the other
files for that to work to update a long out-of-date replica. Now, if we
taught the backup tool about having a replication slot then perhaps we
could have the backend effectively have the same capability proposed
above, but without the need to go get the WAL from the archive
repository.

I'm still not entirely sure that this makes sense to do in the backend
due to the additional load, this is really just some brainstorming.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-04-17 21:30:36 pgsql: Fix unportable code in pgbench.
Previous Message Dmitry Dolgov 2019-04-17 20:37:15 Re: Status of the table access method work