Quick Links

Re: where should I stick that backup?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: where should I stick that backup?
Date:	2020-04-15 22:13:46
Message-ID:	20200415221346.gibhz4wmvl6q7puw@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2020-04-15 09:23:30 -0400, Robert Haas wrote:
> On Tue, Apr 14, 2020 at 9:50 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> > > I'm fairly deeply uncomfortable with what Andres is proposing. I see
> > > that it's very powerful, and can do a lot of things, and that if
> > > you're building something that does sophisticated things with storage,
> > > you probably want an API like that. It does a great job making
> > > complicated things possible. However, I feel that it does a lousy job
> > > making simple things simple.
> >
> > I think it's pretty much exactly the opposite. Your approach seems to
> > move all the complexity to the user, having to build entire combination
> > of commands themselves. Instead of having one or two default commands
> > that do backups in common situations, everyone has to assemble them from
> > pieces.
>
> I think we're mostly talking about different things.

That certainly would explain some misunderstandings ;)

I mostly still am trying to define where we eventually want to be on a
medium to high level. And I don't think we really have agreement on
that. My original understanding of your eventual goal is that it's the
example invocation of pg_basebackup upthread, namely a bunch of shell
arguments to pg_basebackup. And I don't think that's good enough. IMO
it only really make to design incremental steps after we have a rough
agreement on the eventual goal. Otherwise we'll just end up supporting
the outcomes of missteps for a long time.

> I was speaking mostly about the difficulty of developing it. I agree
> that a project which is easier to develop is likely to provide fewer
> benefits to the end user. On the other hand, it might be more likely
> to get done, and projects that don't get done provide few benefits to
> users. I strongly believe we need an incremental approach here.

I agree. My concern is just that we should not expose things to the
user that will make it much harder to evolve going forward.

> I'm not against adding more built-in compression algorithms, but I
> also believe (as I have several times now) that the world moves a lot
> faster than PostgreSQL, which has not added a single new compression
> algorithm to pg_basebackup ever. We had 1 compression algorithm in
> 2011, and we still have that same 1 algorithm today. So, either nobody
> cares, or adding new algorithms is sufficiently challenging - for
> either technical or political reasons - that nobody's managed to get
> it done.

Imo most of the discussion has been around toast, and there the
situation imo is much more complicated than just about adding the
compression algorithm. I don't recall a discussion about adding an
optional dependency to other compression algorithms to pg_basebackup
that didn't go anywhere for either technical or political reasons.

> I think having a simple framework in pg_basebackup for plugging in new
> algorithms would make it noticeably simpler to add LZ4 or whatever
> your favorite compression algorithm is. And I think having that
> framework also be able to use shell commands, so that users don't have
> to wait a decade or more for new choices to show up, is also a good
> idea.

As long as here's sensible defaults, and so that the user doesn't have
to specify paths to binaries for the common cases, I'm OK with that. I'm
not ok with requiring the user to specify shell fragments for things
that should be built in.

If we think the appropriate way to implement extensible compression is
by piping to commandline binaries ([1]), I'd imo e.g. ok if we had a
builtin list of [{fileending, shell-fragment-for-compression}] that is
filled with appropriate values detected at build time for a few common
cases. But then also allowed adding new methods via commandline options.

I guess what I perceived to be the fundamental difference, before this
email, between our positions is that I (still) think that exposing
detailed postprocessing shell fragment style arguments to pg_basebackup,
especially as the only option to use the new capabilities, will nail us
into a corner - but you don't necessarily think so? Where I had/have no
problems with implementing features by *internally* piping through
external binaries, as long as the user doesn't have to always specify
them.

[1] I am not sure, nor the opposite, that piping is a great idea medium
term. One concern is that IIRC windows pipe performance is not great,
and that there's some other portability problems as well. I think
there's also valid concerns about per-file overhead, which might be a
problem for some future uses.

> > I really really don't understand this. Are you suggesting that for
> > server side compression etc we're going to add the ability to specify
> > shell commands as argument to the base backup command? That seems so
> > obviously a non-starter? A good default for backup configurations
> > should be that the PG user that the backup is done under is only allowed
> > to do that, and not that it directly has arbitrary remote command
> > execution.
>
> I hadn't really considered that aspect, and that's certainly a
> concern. But I also don't understand why you think it's somehow a big
> deal. My point is not that clients should have the ability to execute
> arbitrary commands on the server. It's that shelling out to an
> external binary provided by the operating system is a reasonable thing
> to do, versus having everything have to be done by binaries that we
> create. Which I think is what you are also saying right here:

> > But the tool speaking the protocol can just allow piping through
> > whatever tool? Given that there likely is benefits to either doing
> > things on the client side or on the server side, it seems inevitable
> > that there's multiple places that would make sense to have the
> > capability for?
>
> Unless I am misunderstanding you, this is exactly what i was
> proposing, and have been proposing since the first email on the
> thread.

Well, no and yes. As I said above, for me there's a difference between
piping to commands as an internal implementation detail, and between
that being the non-poweruser interface. It may or may not be the right
tradeoff to implement server side compression by piping the output
to/from some binary. IMO it's clearly not the right way to implement
server side compression by specifying shell fragments as arguments to
BASE_BACKUP.

Nor do I think it's the right thing, albeit a tad more debatable, that
for decent client side compression one has to specify a binary whose
path will differ on various platforms (on windows you can't rely on
PATH).

If we were to go for building all this via piopes, utilizing that to
make compression etc extensible for powerusers makes sense to me.

But I don't think it makes sense to design a C API without a rough
picture of how things should eventually look like. If we were, e.g.,
eventually going to do all the work of compressing and transferring data
in one external binary, then a C API exposing transformations in
pg_basebackup doesn't necessarily make sense. If it turns out that
pipes are too inefficient on windows to implement compression filters,
that we need parallel awareness in the API, etc it'll influence the API.

> > > It's possibly not the exact same thing. A special might, for example,
> > > use multiple threads for parallel compression rather than multiple
> > > processes, perhaps gaining a bit of efficiency. But it's doubtful
> > > whether all users care about such marginal improvements.
> >
> > Marginal improvements? Compression scales decently well with the number
> > of cores. pg_basebackup's compression is useless because it's so slow
> > (and because its clientside, but that's IME the lesser issue). I feel I
> > must be misunderstanding what you mean here.
> >
> > gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The
> > latter will still sometimes be a bottleneck (it's a bottlenck in pigz,
> > not available compression cycles), but a lot less commonly than 180.
>
> That's really, really, really not what I was talking about.

What did you mean with the "marginal improvements" paragraph above?

Greetings,

Andres Freund

In response to

Re: where should I stick that backup? at 2020-04-15 13:23:30 from Robert Haas

Responses

Re: where should I stick that backup? at 2020-04-15 23:55:34 from Robert Haas
Re: where should I stick that backup? at 2020-04-20 09:44:58 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2020-04-15 22:18:54	Re: Poll: are people okay with function/operator table redesign?
Previous Message	Alvaro Herrera	2020-04-15 21:56:58	Re: Comment explaining why vacuum needs to push snapshot seems insufficient.