Re: Hooks to track changed pages for backup purposes

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hooks to track changed pages for backup purposes
Date: 2017-08-31 14:44:41
Message-ID: CAB7nPqTVNKDTt+h+MZ+0B+vW8=B9dLhoygMFjujRNho2OYU14g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 31, 2017 at 3:02 PM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> Here is the patch with hooks that I consider sufficient for implementation of incremental backup with pages tracking as extension.
>
> Recently I was posting these things to the thread "Adding hook in BufferSync for backup purposes" [0], but here I start separate thread since Subj field of previous discussion is technically wrong.
>
> Currently various incremental backups can use one of this methods to take diff of a cluster since some LSN:
> 1. Check LSN of every page
> 2. Scan WAL and collect block numbers of changed pages
>
> I propose adding hooks:
> 1. When a buffer is registered in WAL insertion
> This hook is supposed to place blocknumbers in a temporary storage, like backend-local static array.
> 2. When a WAL record insertion is started and finished, to transfer blocknumbers to more concurrency-protected storage.
> 3. When the WAL segment is switched to initiate async transfer of accumulated blocknumbers to durable storage.
>
> When we have accumulated diff blocknumbers for most of segments we can significantly speed up method of WAL scanning. If we have blocknumbers for all segments we can skip WAL scanning at all.
>
> I think that these proposed hooks can enable more efficient backups. How do you think?
>
> Any ideas will be appreciated. This patch is influenced by the code of PTRACK (Yury Zhuravlev and Postgres Professional).

+ if (xlog_insert_buffer_hook)
+ for(nblock = 0; nblock <
xlogreader->max_block_id; nblock++)
+ {
+ if(xlogreader->blocks[nblock].forknum == MAIN_FORKNUM)
+ {
+
xlog_insert_buffer_hook(xlogreader->blocks[nblock].blkno,
+
xlogreader->blocks[nblock].rnode, true);
+ }
+ }
[...]
+ if(xlog_begin_insert_hook)
+ xlog_begin_insert_hook();
Such things are not Postgres-C like. The first one should use
brackets, and the second one proper spacing after "if".

I don't understand what xlog_begin_insert_hook() is good for. There
are no arguments fed to this hook, so modules would not be able to
analyze things in this context, except shared memory and process
state?

Those hooks are put in hot code paths, and could impact performance of
WAL insertion itself. So you basically move the cost of scanning WAL
segments for those blocks from any backup solution to the WAL
insertion itself. Really, wouldn't it be more simple to let for
example the archiver process to create this meta-data if you just want
to take faster backups with a set of segments? Even better, you could
do a scan after archiving N segments, and then use M jobs to do this
work more quickly. (A set of background workers could do this job as
well).

In the backup/restore world, backups can be allowed to be taken at a
slow pace, what matters is to be able to restore them quickly. In
short, anything moving performance from an external backup code path
to a critical backend code path looks like a bad design to begin with.
So I am dubious that what you are proposing here is a good idea.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-31 15:09:37 Assorted leaks and weirdness in parallel execution
Previous Message Michael Paquier 2017-08-31 14:09:20 Re: multiple target of VACUUM command