Re: Streaming a base backup from master

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: Streaming a base backup from master
Date: 2010-09-06 14:07:59
Message-ID: AANLkTimVrLsH=ox4=WnxwYAsy4LSKYRjAKkmRW=nFOJ8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 5, 2010 at 4:51 PM, Martijn van Oosterhout
<kleptog(at)svana(dot)org> wrote:

> If you're working from a known good version of the database at some
> point, yes you are right you have more interesting options. If you
> don't you want something that will fix it.

Sure, in that case you want to restore from backup. Whatever you use
to do that is the same net result. I'm not sure rsync is actually
going to be much faster though since it still has to read all of the
existing database which a normal restore doesn't have to. If the
database has changed significantly that's a lot of extra I/O and
you're probably on a local network with a lot of bandwidth available.

What I'm talking about is how you *take* backups. Currently you have
to take a full backup which if you have a large data warehouse could
be a big job. If only a small percentage of the database is changing
then you could use rsync to reduce the network bandwidth to transfer
your backup but you still have to read the entire database and write
out the entire backup.

Incremental backups mean being able to read just the data blocks that
have been modified and write out a backup file with just those blocks.
When it comes time to restore then you restore the last full backup,
then any incremental backups since then, then replay any logs needed
to bring it to a consistent state.

I think that description pretty much settles the question in my mind.
The implementation choice of scanning the WAL to find all the changed
blocks is more relevant to the use cases where incremental backups are
useful. If you still have to read the entire database then there's not
all that much to be gained except storage space. If you scan the WAL
then you can avoid reading most of your large data warehouse to
generate the incremental and only read the busy portion.

In the use case where the database is extremely busy but writing and
rewriting the same small number of blocks over and over even scanning
the WAL might not be ideal. For that use case it might be more useful
to generate a kind of wal-summary which lists all the blocks touched
since the last checkpoint every checkpoint. But that could be a later
optimization.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2010-09-06 14:14:08 Re: Synchronous replication - patch status inquiry
Previous Message Pavel Stehule 2010-09-06 14:07:02 Re: string function - "format" function proposal