Re: WIP/PoC for parallel backup

From: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
To: Asim R P <apraveen(at)pivotal(dot)io>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2019-08-23 16:04:07
Message-ID: CADM=JegF1XiOojZohv8Q4rr4OaUZxMo67PDE72dz-pUYTWpZ4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 23, 2019 at 3:18 PM Asim R P <apraveen(at)pivotal(dot)io> wrote:

> Hi Asif
>
> Interesting proposal. Bulk of the work in a backup is transferring files
> from source data directory to destination. Your patch is breaking this
> task down in multiple sets of files and transferring each set in parallel.
> This seems correct, however, your patch is also creating a new process to
> handle each set. Is that necessary? I think we should try to achieve this
> using multiple asynchronous libpq connections from a single basebackup
> process. That is to use PQconnectStartParams() interface instead of
> PQconnectdbParams(), wich is currently used by basebackup. On the server
> side, it may still result in multiple backend processes per connection, and
> an attempt should be made to avoid that as well, but it seems complicated.
>
> What do you think?
>
> Asim
>

Thanks Asim for the feedback. This is a good suggestion. The main idea I
wanted to discuss is the design where we can open multiple backend
connections to get the data instead of a single connection.
On the client side we can have multiple approaches, One is to use
asynchronous APIs ( as suggested by you) and other could be to decide
between multi-process and multi-thread. The main point was we can extract
lot of performance benefit by using the multiple connections and I built
this POC to float the idea of how the parallel backup can work, since the
core logic of getting the files using multiple connections will remain the
same, wether we use asynchronous, multi-process or multi-threaded.

I am going to address the division of files to be distributed evenly among
multiple workers based on file sizes, that would allow to get some concrete
numbers as well as it will also us to gauge some benefits between async and
multiprocess/thread approach on client side.

Regards,
Asif

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2019-08-23 17:03:50 Re: Hstore OID bigger than an integer
Previous Message Tomas Vondra 2019-08-23 15:54:16 Re: Hstore OID bigger than an integer