Re: design for parallel backup

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-04-20 12:49:50
Message-ID: CAA4eK1KHxnxwrgCccjC9Coa9QG4a_-FxLZr_cjBCy018q3gRAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 15, 2020 at 9:27 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.
>
> Specifically, as Andres recently pointed out[1], almost anything that
> you might want to do on the client side, you might also want to do on
> the server side. We already have an option to let the client compress
> each tarball, but you might also want the server to, say, compress
> each tarball[2]. Similarly, you might want either the client or the
> server to be able to encrypt each tarball, or compress but with a
> different compression algorithm than gzip. If, as is presently the
> case, the server is always returning a set of tarballs, it's pretty
> easy to see how to make this work in the same way on either the client
> or the server, but if the server returns a set of tarballs in
> non-parallel backup cases, and a set of tarballs in parallel backup
> cases, it's a lot harder to see how that any sort of server-side
> processing should work, or how the same mechanism could be used on
> either the client side or the server side.
>
> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs.
>

It is not apparent how you are envisioning this division on the
server-side. I think in the currently proposed patch, each worker on
the client-side requests the specific files. So, how are workers going
to request such numbered files and how we will ensure that the work
division among workers is fair?

> This is more or less the naming convention that
> BART uses for its parallel backup implementation, which, incidentally,
> I did not write. I don't really care if we pick something else, but it
> seems like a sensible choice. The reason why I say "some or all" is
> that some workers might not get any of the data for a given
> tablespace. In fact, it's probably desirable to have different workers
> work on different tablespaces as far as possible, to maximize parallel
> I/O, but it's quite likely that you will have more workers than
> tablespaces. So you might end up, with pg_basebackup -j4, having the
> server send you base-1.tar and base-2.tar and base-4.tar, but not
> base-3.tar, because worker 3 spent all of its time on user-defined
> tablespaces, or was just out to lunch.
>
> Now, if you use -Fp, those tar files are just going to get extracted
> anyway by pg_basebackup itself, so you won't even know they exist.
> However, if you use -Ft, you're going to end up with more files than
> before. This seems like something of a wart, because you wouldn't
> necessarily expect that the set of output files produced by a backup
> would depend on the degree of parallelism used to take it. However,
> I'm not sure I see a reasonable alternative. The client could try to
> glue all of the related tar files sent by the server together into one
> big tarfile, but that seems like it would slow down the process of
> writing the backup by forcing the different server connections to
> compete for the right to write to the same file.
>

I think it also depends to some extent what we decide in the nearby
thread [1] related to support of compression/encryption. Say, if we
want to support a new compression on client-side then we need to
anyway process the contents of each tar file in which case combining
into single tar file might be okay but not sure what is the right
thing here. I think this part needs some more thoughts.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoYr7%2B-0_vyQoHbTP5H3QGZFgfhnrn6ewDteF%3DkUqkG%3DFw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-04-20 13:02:21 Re: PG compilation error with Visual Studio 2015/2017/2019
Previous Message Jehan-Guillaume de Rorthais 2020-04-20 12:22:35 Re: [BUG] non archived WAL removed during production crash recovery