Re: design for parallel backup

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-04-21 04:50:01
Message-ID: CAA4eK1LQuN6LCaiHduaF0L47DUQ2e5cL_QNMytPRVUNidOA0dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 21, 2020 at 2:40 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2020-04-20 16:36:16 -0400, Robert Haas wrote:
>
> > If a backup client - either current or hypothetical - is compressing
> > and encrypting, then it doesn't have either a filesystem I/O or a
> > network I/O in progress while it's doing so. You take not only the hit
> > of the time required for compression and/or encryption, but also use
> > that much less of the available network and/or I/O capacity.
>
> I don't think it's really the time for network/file I/O that's the
> issue. Sure memcpy()'ing from the kernel takes time, but compared to
> encryption/compression it's not that much. Especially for compression,
> it's not really lack of cycles for networking that prevent a higher
> throughput, it's that after buffering a few MB there's just no point
> buffering more, given compression will plod along with 20-100MB/s.
>

It is quite likely that compression can benefit more from parallelism
as compared to the network I/O as that is mostly a CPU intensive
operation but I am not sure if we can just ignore the benefit of
utilizing the network bandwidth. In our case, after copying from the
network we do write that data to disk, so during filesystem I/O the
network can be used if there is some other parallel worker processing
other parts of data.

Also, there may be some users who don't want their data to be
compressed due to some reason like the overhead of decompression is so
high that restore takes more time and they are not comfortable with
that as for them faster restore is much more critical then compressed
or fast back up. So, for such things, the parallelism during backup
as being discussed in this thread will still be helpful.

OTOH, I think without some measurements it is difficult to say that we
have significant benefit by paralysing the backup without compression.
I have scanned the other thread [1] where the patch for parallel
backup was discussed and didn't find any performance numbers, so
probably having some performance data with that patch might give us a
better understanding of introducing parallelism in the backup.

[1] - https://www.postgresql.org/message-id/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-04-21 04:52:46 Re: Do we need to handle orphaned prepared transactions in the server?
Previous Message Tom Lane 2020-04-21 04:42:40 Bogus documentation for bogus geometric operators