Re: WIP/PoC for parallel backup

From: Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2019-08-23 17:50:09
Message-ID: CALtqXTcgLWm9+HSJNY5-nhjKBLzHjEzthcJ9csUVrisWS6VmbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 23, 2019 at 10:26 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Greetings,
>
> * Asif Rehman (asifr(dot)rehman(at)gmail(dot)com) wrote:
> > On Fri, Aug 23, 2019 at 3:18 PM Asim R P <apraveen(at)pivotal(dot)io> wrote:
> > > Interesting proposal. Bulk of the work in a backup is transferring
> files
> > > from source data directory to destination. Your patch is breaking this
> > > task down in multiple sets of files and transferring each set in
> parallel.
> > > This seems correct, however, your patch is also creating a new process
> to
> > > handle each set. Is that necessary? I think we should try to achieve
> this
> > > using multiple asynchronous libpq connections from a single basebackup
> > > process. That is to use PQconnectStartParams() interface instead of
> > > PQconnectdbParams(), wich is currently used by basebackup. On the
> server
> > > side, it may still result in multiple backend processes per
> connection, and
> > > an attempt should be made to avoid that as well, but it seems
> complicated.
> >
> > Thanks Asim for the feedback. This is a good suggestion. The main idea I
> > wanted to discuss is the design where we can open multiple backend
> > connections to get the data instead of a single connection.
> > On the client side we can have multiple approaches, One is to use
> > asynchronous APIs ( as suggested by you) and other could be to decide
> > between multi-process and multi-thread. The main point was we can extract
> > lot of performance benefit by using the multiple connections and I built
> > this POC to float the idea of how the parallel backup can work, since the
> > core logic of getting the files using multiple connections will remain
> the
> > same, wether we use asynchronous, multi-process or multi-threaded.
> >
> > I am going to address the division of files to be distributed evenly
> among
> > multiple workers based on file sizes, that would allow to get some
> concrete
> > numbers as well as it will also us to gauge some benefits between async
> and
> > multiprocess/thread approach on client side.
>
> I would expect you to quickly want to support compression on the server
> side, before the data is sent across the network, and possibly
> encryption, and so it'd likely make sense to just have independent
> processes and connections through which to do that.
>
> +1 for compression and encryption, but I think parallelism will give us
the benefit with and without the compression.

Thanks,
>
> Stephen
>

--
Ibrar Ahmed

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-08-23 18:27:47 Re: obsoleting plpython2u and defaulting plpythonu to plpython3u
Previous Message Stephen Frost 2019-08-23 17:26:38 Re: WIP/PoC for parallel backup