Re: design for parallel backup

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-04-20 20:02:32
Message-ID: 892b057f-bc33-6bb3-0abf-8bd5674ba901@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-04-15 17:57, Robert Haas wrote:
> Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.

That would clearly be a good goal. Non-parallel backup should ideally
be parallel backup with one worker.

But it doesn't follow that the proposed design is wrong. It might just
be that the design of the existing backup should change.

I think making the wire format so heavily tied to the tar format is
dubious. There is nothing particularly fabulous about the tar format.
If the server just sends a bunch of files with metadata for each file,
the client can assemble them in any way they want: unpacked, packed in
several tarball like now, packed all in one tarball, packed in a zip
file, sent to S3, etc.

Another thing I would like to see sometime is this: Pull a minimal
basebackup, start recovery and possibly hot standby before you have
received all the files. When you need to access a file that's not there
yet, request that as a priority from the server. If you nudge the file
order a little with perhaps prewarm-like data, you could get a mostly
functional standby without having to wait for the full basebackup to
finish. Pull a file on request is a requirement for this.

> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs.

I understand the other side of this: Why not compress or encrypt the
backup already on the server side? Makes sense. But this way seems
weird and complicated. If I want a backup, I want one file, not an
unpredictable set of files. How do I even know I have them all? Do we
need a meta-manifest?

A format such as ZIP would offer more flexibility, I think. You can
build a single target file incrementally, you can compress or encrypt
each member file separately, thus allowing some compression etc. on the
server. I'm not saying it's perfect for this, but some more thinking
about the archive formats would potentially give some possibilities.

All things considered, we'll probably want more options and more ways of
doing things.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-20 20:03:33 Re: new heapcheck contrib module
Previous Message Justin Pryzby 2020-04-20 19:57:40 Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables