Re: where should I stick that backup?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: where should I stick that backup?
Date: 2020-04-15 23:55:34
Message-ID: CA+TgmoZyU0tDAG30SzNwpGkhtXYsjenAoYt6ubT=3d3matUMGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 15, 2020 at 6:13 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I guess what I perceived to be the fundamental difference, before this
> email, between our positions is that I (still) think that exposing
> detailed postprocessing shell fragment style arguments to pg_basebackup,
> especially as the only option to use the new capabilities, will nail us
> into a corner - but you don't necessarily think so? Where I had/have no
> problems with implementing features by *internally* piping through
> external binaries, as long as the user doesn't have to always specify
> them.

My principle concern is actually around having a C API and a flexible
command-line interface. If we rearrange the code and the pg_basebackup
command line syntax so that it's easy to add new "filters" and
"targets", then I think that's a very good step forward. It's of less
concern to me whether those "filters" and "targets" are (1) C code
that we ship as part of pg_basebackup, (2) C code by extension authors
that we dynamically load into pg_basebackup, (3) off-the-shelf
external programs that we invoke, or (4) special external programs
that we provide which do special magic. However, of those options, I
like #4 least, because it seems like a pain in the tail to implement.
It may turn out to be the most powerful and flexible, though I'm not
completely sure about that yet.

As to exactly how far we can get with #3, I think it depends a good
deal on the answer to this question you pose in a footnote:

> [1] I am not sure, nor the opposite, that piping is a great idea medium
> term. One concern is that IIRC windows pipe performance is not great,
> and that there's some other portability problems as well. I think
> there's also valid concerns about per-file overhead, which might be a
> problem for some future uses.

If piping stuff through shell commands performs well for use cases
like compression, then I think we can get pretty far with piping
things through shell commands. It means we can use any compression at
all with no build-time dependency on that compressor. People can
install anything they want, stick it in $PATH, and away they go. I see
no particular reason to dislike that kind of thing; in fact, I think
it offers many compelling advantages. On the other hand, if we really
need to interact directly with the library to get decent performance,
because, say, pipes are too slow, then the approach of piping things
through an arbitrary shell commands is a lot less exciting.

Even then, though, I wonder how many runtime dependencies we're
seriously willing to add. I imagine we can add one or two more
compression algorithms without giving everybody fits, even if it means
adding optional build-time and run-time dependencies on some external
libraries. Any more than that is likely to provoke a backlash. And I
doubt whether we're willing to have the postgresql operating system
package depend on something like libgcrypt at all. I would expect such
a proposal to meet with vigorous objections. But without such a
dependency, how would we realistically get encrypted backups except by
piping through a shell command? I don't really see a way, and letting
a user specify a shell fragment to define what happens there seems
pretty reasonable to me. I'm also not very sure to what we can assume,
with either compression or encryption, that one size fits all. If
there are six popular compression libraries and four popular
encryption libraries, does anyone really believe that it's going to be
OK for 'yum install postgresql-server' to suck in all of those things?
Or, even if that were OK or if it we could somehow avoid it, what are
the chances that we'd actually go to the trouble of building
interfaces to all of those things? I'd rate them as slim to none; we
suck at that sort of thing. Exhibit A: The work to make PostgreSQL
support more than one SSL library.

I'm becoming fairly uncertain as to how far we can get with shell
commands; some of the concerns raised about, for example, connection
management when talking to stuff like S3 are very worrying. At the
same time, I think we need to think pretty seriously about some of the
upsides of shell commands. The average user cannot write a C library
that implements an API. The average user cannot write a C binary that
speaks a novel, PostgreSQL-specific protocol. Even the above-average
user who is capable of doing those things probably won't have the time
to actually do it. So if thing you have to do to make PostgreSQL talk
to the new sljgsjl compressor is either of those things, then we will
not have sljgsjl compression support for probably a decade after it
becomes the gold standard that everyone else in the industry is using.
If what you have to do is 'yum install sljgsjl' and then pg_basebackup
--client-filter='shell sljgsjl', people can start using it as soon as
their favorite distro packages it, without anyone who reads this
mailing list needing to do any work whatsoever. If what you have to
do is create a 'sljgsjl.json' file in some PostgreSQL install
directory that describes the salient properties of this compressor,
and then after that you can say pg_basebackup --client-filter=sljgsjl,
that's also accessible to a broad swath of users. Now, it may be that
there's no practical way to make things that easy. But, to the extent
that we can, I think we should. The ability to integrate new
technology without action by PostgreSQL core developers is not the
only consideration here, but it's definitely a good thing to have
insofar as we reasonably can.

> But I don't think it makes sense to design a C API without a rough
> picture of how things should eventually look like. If we were, e.g.,
> eventually going to do all the work of compressing and transferring data
> in one external binary, then a C API exposing transformations in
> pg_basebackup doesn't necessarily make sense. If it turns out that
> pipes are too inefficient on windows to implement compression filters,
> that we need parallel awareness in the API, etc it'll influence the API.

Yeah. I think we really need to understand the performance
characteristics of pipes better. If they're slow, then anything that
needs to be fast has to work some other way (but we could still
provide a pipe-based slow way for niche uses).

> > That's really, really, really not what I was talking about.
>
> What did you mean with the "marginal improvements" paragraph above?

I was talking about running one compressor processor with multiple
compression threads each reading from a separate pipe, vs. running
multiple processes each with a single thread doing the same thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-15 23:57:45 Re: xid wraparound danger due to INDEX_CLEANUP false
Previous Message Andres Freund 2020-04-15 23:44:08 Re: Include sequence relation support in logical replication