Re: where should I stick that backup?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: where should I stick that backup?
Date: 2020-04-06 21:27:24
Message-ID: CA+TgmobtGo+vdUkhBZ8Ocdpfqbib5KLf7Ov7vnqdJAB3X+AN9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 6, 2020 at 1:32 PM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> Now, if we were just talking about compression, it would actually be
> interesting to implement some sort of "postgres compression API" if
> you will, that is implemented by a shared library. This library could
> then be used from pg_basebackup or from anything else that needs
> compression. And anybody who wants could then do a "<compression X>
> for PostgreSQL" module, removing the need for us to carry such code
> upstream.

I think it could be more general than a compression library. It could
be a store-my-stuff-and-give-it-back-to-me library, which might do
compression or encryption or cloud storage or any combination of the
three, and probably other stuff too. Imagine that you first call an
init function with a namespace that is basically a string provided by
the user. Then you open a file either for read or for write (but not
both). Then you read or write a series of chunks (depending on the
file mode). Then you close the file. Then you can do the same with
more files. Finally at the end you close the namespace. You don't
really need to care where or how the functions you are calling store
the data. You just need them to return proper error indicators if by
chance they fail.

As compared with my previous proposal, this would work much better for
pg_basebackup -Fp, because you wouldn't launch a new bzip2 process for
every file. You'd just bzopen(), which is presumably quite lightweight
by comparison. The reasons I didn't propose it are:

1. Running bzip2 on every file in a plain-format backup seems a lot
sillier than running it on every tar file in a tar-format backup.
2. I'm not confident that the command specified here actually needs to
be anything very complicated (unlike archive_command).
3. The barrier to entry for a loadable module is a lot higher than for
a shell command.
4. I think that all of our existing infrastructure for loadable
modules is backend-only.

Now all of these are up for discussion. I am sure we can make the
loadable module stuff work in frontend code; it would just take some
work. A C interface for extensibility is very significantly harder to
use than a shell interface, but it's still way better than no
interface. The idea that this shell command can be something simple is
my current belief, but it may turn out to be wrong. And I'm sure
somebody can propose a good reason to do something with every file in
a plain-format backup rather than using tar format.

All that being said, I still find it hard to believe that we will want
to add dependencies for libraries that we'd need to do encryption or
S3 cloud storage to PostgreSQL itself. So if we go with this more
integrated approach we should consider the possibility that, when the
dust settles, PostgreSQL will only have pg_basebackup
--output-plugin=lz4 and Aurora will also have pg_basebackup
--output-plugin=s3. From my point of view, that would be less than
ideal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-04-06 21:27:44 Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Previous Message James Coleman 2020-04-06 21:22:45 Re: [PATCH] Incremental sort (was: PoC: Partial sort)