Re: design for parallel backup

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-05-04 18:04:32
Message-ID: CA+TgmoZNUsr_Bpjv9T5D8-UW1Rnh3opDbD+PjX33SXR-NCJ2sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, May 3, 2020 at 1:49 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > The run-to-run variations between the runs without cache control are
> > > pretty large. So this is probably not the end-all-be-all numbers. But I
> > > think the trends are pretty clear.
> >
> > Could you be explicit about what you think those clear trends are?
>
> Largely that concurrency can help a bit, but also hurt
> tremendously. Below is some more detailed analysis, it'll be a bit
> long...

OK, thanks. Let me see if I can summarize here. On the strength of
previous experience, you'll probably tell me that some parts of this
summary are wildly wrong or at least "not quite correct" but I'm going
to try my best.

- Server-side compression seems like it has the potential to be a
significant win by stretching bandwidth. We likely need to do it with
10+ parallel threads, at least for stronger compressors, but these
might be threads within a single PostgreSQL process rather than
multiple separate backends.

- Client-side cache management -- that is, use of
posix_fadvise(DONTNEED), posix_fallocate, and sync_file_range, where
available -- looks like it can improve write rates and CPU efficiency
significantly. Larger block sizes show a win when used together with
such techniques.

- The benefits of multiple concurrent connections remain somewhat
elusive. Peter Eisentraut hypothesized upthread that such an approach
might be the most practical way forward for networks with a high
bandwidth-delay product, and I hypothesized that such an approach
might be beneficial when there are multiple tablespaces on independent
disks, but we don't have clear experimental support for those
propositions. Also, both your data and mine indicate that too much
parallelism can lead to major regressions.

- Any work we do while trying to make backup super-fast should also
lend itself to super-fast restore, possibly including parallel
restore. Compressed tarfiles don't permit random access to member
files. Uncompressed tarfiles do, but software that works this way is
not commonplace. The only mainstream archive format that seems to
support random access seems to be zip. Adopting that wouldn't be
crazy, but might limit our choice of compression options more than
we'd like. A tar file of individually compressed files might be a
plausible alternative, though there would probably be some hit to
compression ratios for small files. Then again, if a single,
highly-efficient process can handle a server-to-client backup, maybe
the same is true for extracting a compressed tarfile...

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-05-04 18:57:14 Re: Unify drop-by-OID functions
Previous Message Bossart, Nathan 2020-05-04 17:44:21 race condition when writing pg_control