Re: refactoring basebackup.c

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>
Subject: Re: refactoring basebackup.c
Date: 2021-07-08 15:56:10
Message-ID: CA+TgmoYgVN=-Yoh71r3P9N7eKysd7_9b9s+1QFfFcs3w7Z-tig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 21, 2020 at 12:14 PM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> v2-0001 through v2-0009 still apply cleanly, but v2-0010 no longer applies. It seems to be conflicting with Heikki's work from August. Could you rebase please?

Here at last is a new version. I've dropped the "bbarchiver" patch for
now, added a new patch that I'll talk about below, and revised the
others. I'm pretty happy with the code now, so I guess the main things
that I'd like feedback on are (1) whether design changes seem to be
needed and (2) the UI. Once we have that stuff hammered out, I'll work
on adding documentation, which is missing at present. The interesting
patches in terms of functionality are 0006 and 0007; the rest is
preparatory refactoring.

0006 adds a concept of base backup "targets," which means that it lets
you send the base backup to someplace other than the client. You
specify the target using a new "-t" option to pg_basebackup. By way of
example, 0006 adds a "blackhole" target which throws the backup away
instead of sending it anywhere, and also a "server" target which
stores the backup to the server filesystem in lieu of streaming it to
the client. So you can say something like "pg_basebackup -Xnone -Ft -t
server:/backup/2021-07-08" and, provided that you're superuser, the
server will try to drop the backup there. At present, you can't use
-Fp or -Xfetch or -Xstream with a backup target, because that
functionality is implemented on the client side. I think that's an
acceptable restriction. Eventually I imagine we will want to have
targets like "aws" or "s3" or maybe some kind of plug-in system for
new targets. I haven't designed anything like that yet, but I think
it's probably not all that hard to generalize what I've got.

0007 adds server-side compression; currently, it only supports
server-side compression using gzip, but I hope that it won't be hard
to generalize that to support LZ4 as well, and Andres told me he
thinks we should aim to support zstd since that library has built-in
parallel compression which is very appealing in this context. So you
say something like "pg_basebackup -Ft --server-compression=gzip -D
/backup/2021-07-08" or, if you want that compressed backup stored on
the server and compressed as hard as possible, you could say
"pg_basebackup -Xnone -Ft --server-compression=gzip9 -t
server:/backup/2021-07-08". Unfortunately, here again there are a
number of features that are implemented on the client side, and they
don't work in combination with this. -Fp could be made to work by
teaching the client to decompress; I just haven't written the code to
do that. It's probably not very useful in general, but maybe there's a
use case if you're really tight on network bandwidth. Making -R work
looks outright useless, because the client would have to get the whole
compressed tarfile from the server and then uncompress it, edit the
tar file, and recompress. That seems like a thing no one can possibly
want. Also, if you say pg_basebackup -Ft -D- >whatever.tar, the server
injects the backup manifest into the tarfile, which if you used
--server-compression would require decompressing and recompressing the
whole thing, so it doesn't seem worth supporting. It's more likely to
be a footgun than to help anybody. This option can be used with
-Xstream or -Xfetch, but it doesn't compress pg_wal.tar, because
that's generated on the client side.

The thing I'm really unhappy with here is the -F option to
pg_basebackup, which presently allows only p for plain or t for tar.
For purposes of these patches, I've essentially treated this as if -Fp
means "I want the tar files the server sends to be extracted" and
"-Ft" as if it means "I'm happy with them the way they are." Under
that interpretation, it's fine for --server-compression to cause e.g.
base.tar.gz to be written, because that's what the server sent. But
it's not really a "tar" output format; it's a "tar.gz" output format.
However, it doesn't seem to make any sense to define -Fz to mean "i
want tar.gz output" because -Z or -z already produces tar.gz output
when used with -Ft, and also because it would be redundant to make
people specify both -Fz and --server-compression. Similarly, when you
use --target, the output format is arguably, well, nothing. I mean,
some tar files got stored to the target, but you don't have them, but
again it seems redundant to have people specify --target and then also
have to change the argument to -F. Hindsight being 20-20, I think we
would have been better off not having a -Ft or -Fp option at all, and
having an --extract option that says you want to extract what the
server sends you, but it's probably too late to make that change now.
Or maybe it isn't, and we should just break command-line argument
compatibility for v15. I don't know. Opinions appreciated, especially
if they are nuanced.

If you're curious about what the other patches in the series do,
here's a very fast recap; see commit messages for more. 0001 revises
the grammar for some replication commands to use an extensible-options
syntax. 0002 is a trivial refactoring of basebackup.c. 0003 and 0004
refactor the server's basebackup.c and the client's pg_basebackup.c,
respectively, by introducing abstractions called bbsink and
bbstreamer. 0005 introduces a new COPY sub-protocol for taking base
backups. I think it's worth mentioning that I believe that this
refactoring is quite powerful and could let us do a bunch of other
things that this patch set doesn't attempt. For instance, since this
makes it pretty easy to implement server-side compression, it could
probably also pretty easily be made to do server-side encryption, if
you're brave enough to want to have a discussion on pgsql-hackers
about how to design an encryption feature.

Thanks to my colleague Tushar Ahuja for helping test some of this code.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v3-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch application/octet-stream 24.3 KB
v3-0006-Support-base-backup-targets.patch application/octet-stream 32.3 KB
v3-0005-Modify-pg_basebackup-to-use-a-new-COPY-subprotoco.patch application/octet-stream 33.2 KB
v3-0007-WIP-Server-side-gzip-compression.patch application/octet-stream 17.8 KB
v3-0004-Introduce-bbstreamer-abstraction-to-modularize-pg.patch application/octet-stream 80.2 KB
v3-0002-Refactor-basebackup.c-s-_tarWriteDir-function.patch application/octet-stream 4.0 KB
v3-0003-Introduce-bbsink-abstraction-to-modularize-base-b.patch application/octet-stream 76.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2021-07-08 16:23:51 Re: Replace remaining castNode(…, lfirst(…)) and friends calls with l*_node()
Previous Message Tom Lane 2021-07-08 15:21:48 Re: pgsql: Don't try to print data type names in slot_store_error_callback(