Re: Implementing incremental backup

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: klaussfreire(at)gmail(dot)com
Cc: sfrost(at)snowman(dot)net, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Implementing incremental backup
Date: 2013-06-20 01:25:21
Message-ID: 20130620.102521.1131455910749185193.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Wed, Jun 19, 2013 at 8:40 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>>> On Wed, Jun 19, 2013 at 6:20 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>>>> * Claudio Freire (klaussfreire(at)gmail(dot)com) wrote:
>>>>> I don't see how this is better than snapshotting at the filesystem
>>>>> level. I have no experience with TB scale databases (I've been limited
>>>>> to only hundreds of GB), but from my limited mid-size db experience,
>>>>> filesystem snapshotting is pretty much the same thing you propose
>>>>> there (xfs_freeze), and it works pretty well. There's even automated
>>>>> tools to do that, like bacula, and they can handle incremental
>>>>> snapshots.
>>>>
>>>> Large databases tend to have multiple filesystems and getting a single,
>>>> consistent, snapshot across all of them while under load is..
>>>> 'challenging'. It's fine if you use pg_start/stop_backup() and you're
>>>> saving the XLOGs off, but if you can't do that..
>>>
>>> Good point there.
>>>
>>> I still don't like the idea of having to mark each modified page. The
>>> WAL compressor idea sounds a lot more workable. As in scalable.
>>
>> Why do you think WAL compressor idea is more scalable? I really want
>> to know why. Besides the unlogged tables issue, I can accept the idea
>> if WAL based solution is much more efficient. If there's no perfect,
>> ideal solution, we need to prioritize things. My #1 priority is
>> allowing to create incremental backup against TB database, and the
>> backup file should be small enough and the time to create it is
>> acceptable. I just don't know why scanning WAL stream is much cheaper
>> than recording modified page information.
>
> Because it aggregates updates.
>
> When you work at the storage manager level, you only see block-sized
> operations. This results in the need to WAL-log bit-sized updates on
> some hypothetical dirty-map index. Even when done 100% efficiently,
> this implies at least one write per dirtied block, which could as much
> as double write I/O in the worse (and totally expectable) case.
>
> When you do it at WAL segment recycle time, or better yet during
> checkpoints, you deal with checkpoint-scale operations. You can
> aggregate dirty-map updates, if you keep a dirty-map, which could not
> only reduce I/O considerably (by a much increased likelihood of write
> coalescence), but also let you schedule it better (toss it in the
> background, with checkpoints). This is for gathering dirty-map
> updates, which still leaves you with the complex problem of then
> actually snapshotting those pages consistently without interfering
> with ongoing transactions.
>
> If you do a WAL compressor, WAL entries are write-once, so you'll have
> no trouble snapshotting those pages. You have the checkpoint's initial
> full page write, so you don't even have to read the page, and you can
> accumulate all further partial writes into one full page write, and
> dump that on an "incremental archive". So, you get all the I/O
> aggregation from above, which reduces I/O to the point where it only
> doubles WAL I/O. It's bound by a constant, and in contrast to
> dirty-map updates, it's sequential I/O so it's a lot faster. It's thus
> perfectly scalable.
>
> Not only that, but you're also amortizing incremental backup costs
> over time, as you're making them constantly as opposed to regular
> intervals. You'll have one incremental backup per checkpoint. If you
> want to coalesce backups, you launch another compressor to merge the
> last incremental checkpoint with the new one. And, now this is the
> cherry on top, you only have to do this on the archived WALs, which
> means you could very well do it on another system, freeing your main
> cluster from all this I/O. It's thus perfectly scalable.
>
> The only bottleneck here, is WAL archiving. This assumes you can
> afford WAL archiving at least to a local filesystem, and that the WAL
> compressor is able to cope with WAL bandwidth. But I have no reason to
> think you'd be able to cope with dirty-map updates anyway if you were
> saturating the WAL compressor, as the compressor is more efficient on
> amortized cost per transaction than the dirty-map approach.

Thank you for detailed explanation. I will think more about this.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2013-06-20 01:38:44 warn_unused_result in pgbench
Previous Message Peter Eisentraut 2013-06-20 01:19:18 slightly confusing JSON error context