Re: where should I stick that backup?

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: where should I stick that backup?
Date: 2020-04-06 18:31:50
Message-ID: 20200406183150.GE13712@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> On Mon, Apr 6, 2020 at 4:45 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > * Noah Misch (noah(at)leadboat(dot)com) wrote:
> > > On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote:
> > > > What I'm thinking about is: suppose we add an option to pg_basebackup
> > > > with a name like --pipe-output. This would be mutually exclusive with
> > > > -D, but would work at least with -Ft and maybe also with -Fp. The
> > > > argument to --pipe-output would be a shell command to be executed once
> > > > per output file. Any instance of %f in the shell command would be
> > > > replaced with the name of the file that would have been written (and
> > > > %% would turn into a single %). The shell command itself would be
> > > > executed via system(). So if you want to compress, but using some
> > > > other compression program instead of gzip, you could do something
> > > > like:
> > > >
> > > > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'
> > >
> > > Seems good to me. I agree -Fp is a "maybe" since the overhead will be high
> > > for small files.
> >
> > For my 2c, at least, introducing more shell commands into critical parts
> > of the system is absolutely the wrong direction to go in.
> > archive_command continues to be a mess that we refuse to clean up or
> > even properly document and the project would be much better off by
> > trying to eliminate it rather than add in new ways for users to end up
> > with bad or invalid backups.
>
> I think the bigger problem with archive_command more comes from how
> it's defined to work tbh. Which leaves a lot of things open.
>
> This sounds to me like a much narrower use-case, which makes it a lot
> more OK. But I agree we have to be careful not to get back into that
> whole mess. One thing would be to clearly document such things *from
> the beginning*, and not try to retrofit it years later like we ended
> up doing with archive_command.

This sounds like a much broader use-case to me, not a narrower one. I
agree that we don't want to try and retrofit things years later.

> And as Robert mentions downthread, the fsync() issue is definitely a
> real one, but if that is documented clearly ahead of time, that's a
> reasonable level foot-gun I'd say.

Documented how..?

> > Further, having a generic shell script approach like this would result
> > in things like "well, we don't need to actually add support for X, Y or
> > Z, because we have this wonderful generic shell script thing and you can
> > write your own, and therefore we won't accept patches which do add those
> > capabilities because then we'd have to actually maintain that support."
>
> In principle, I agree with "shellscripts suck".
>
> Now, if we were just talking about compression, it would actually be
> interesting to implement some sort of "postgres compression API" if
> you will, that is implemented by a shared library. This library could
> then be used from pg_basebackup or from anything else that needs
> compression. And anybody who wants could then do a "<compression X>
> for PostgreSQL" module, removing the need for us to carry such code
> upstream.

Getting a bit off-track here, but I actually think we should absolutely
figure out a way to support custom compression options in PG. I had
been thinking of something along the lines of per-datatype actually,
where each data type could define it's own compression method, since we
know that different data has different characteristics and therefore
might benefit from different ways of compressing it. Though it's also
true that generically there are tradeoffs between cpu time, memory size,
resulting size on disk, etc, and having ways to pick between those could
also be interesting.

> There's been discussions of that for the backend before IIRC, but I
> don't recall the conclusions. And in particular, I don't recall if it
> included the idea of being able to use it in situations like this as
> well, and with *run-time loading*.

Run-time loading brings in the fun that maybe we aren't able to load the
library when we need to too, and what then? :)

> And that said, then we'd limit ourselves to compression. We'd still
> need a way to deal with encryption...

And shipping stuff off to some remote server too, at least if we are
going to tell users that they can use this approach to send their
backups to s3... (and that reminds me- there's other things to think
about there too, like maybe you don't want to ship off 0-byte files to
s3, or maybe you don't want to ship tiny files, because there's costs
associated with these things...).

Thanks,

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-04-06 18:44:06 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Previous Message Floris Van Nee 2020-04-06 18:31:08 RE: Index Skip Scan