Re: WIP/PoC for parallel backup

From: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2019-10-07 13:35:19
Message-ID: CADM=Jegn-Tbg55b5iHyt2aXG_j_qnb0piz0tvCRAcrRC6ejcmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 7, 2019 at 6:05 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Oct 7, 2019 at 8:48 AM Asif Rehman <asifr(dot)rehman(at)gmail(dot)com> wrote:
> > Sure. Though the backup manifest patch calculates and includes the
> checksum of backup files and is done
> > while the file is being transferred to the frontend-end. The manifest
> file itself is copied at the
> > very end of the backup. In parallel backup, I need the list of filenames
> before file contents are transferred, in
> > order to divide them into multiple workers. For that, the manifest file
> has to be available when START_BACKUP
> > is called.
> >
> > That means, backup manifest should support its creation while excluding
> the checksum during START_BACKUP().
> > I also need the directory information as well for two reasons:
> >
> > - In plain format, base path has to exist before we can write the file.
> we can extract the base path from the file
> > but doing that for all files does not seem a good idea.
> > - base backup does not include the content of some directories but those
> directories although empty, are still
> > expected in PGDATA.
> >
> > I can make these changes part of parallel backup (which would be on top
> of backup manifest patch) or
> > these changes can be done as part of manifest patch and then parallel
> can use them.
> >
> > Robert what do you suggest?
>
> I think we should probably not use backup manifests here, actually. I
> initially thought that would be a good idea, but after further thought
> it seems like it just complicates the code to no real benefit.

Okay.

> I
> suggest that the START_BACKUP command just return a result set, like a
> query, with perhaps four columns: file name, file type ('d' for
> directory or 'f' for file), file size, file mtime. pg_basebackup will
> ignore the mtime, but some other tools might find that useful
> information.
>
yes current patch already returns the result set. will add the additional
information.

> I wonder if we should also split START_BACKUP (which should enter
> non-exclusive backup mode) from GET_FILE_LIST, in case some other
> client program wants to use one of those but not the other. I think
> that's probably a good idea, but not sure.
>

Currently pg_basebackup does not enter in exclusive backup mode and other
tools have to
use pg_start_backup() and pg_stop_backup() functions to achieve that. Since
we are breaking
backup into multiple command, I believe it would be a good idea to have
this option. I will include
it in next revision of this patch.

>
> I still think that the files should be requested one at a time, not a
> huge long list in a single command.
>
sure, will make the change.

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ibrar Ahmed 2019-10-07 13:43:22 Re: WIP/PoC for parallel backup
Previous Message Fujii Masao 2019-10-07 13:13:56 Re: Standby accepts recovery_target_timeline setting?