Re: design for parallel backup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-04-22 15:24:02
Message-ID: 20200422152402.ck7ziyzgfopgz7bd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-04-22 09:52:53 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 6:57 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I agree that trying to make backups very fast is a good goal (or well, I
> > think not very slow would be a good descriptor for the current
> > situation). I am just trying to make sure we tackle the right problems
> > for that. My gut feeling is that we have to tackle compression first,
> > because without addressing that "all hope is lost" ;)
>
> OK. I have no objection to the idea of starting with (1) server side
> compression and (2) a better compression algorithm. However, I'm not
> very sold on the idea of relying on parallelism that is specific to
> compression. I think that parallelism across the whole operation -
> multiple connections, multiple processes, etc. - may be a more
> promising approach than trying to parallelize specific stages of the
> process. I am not sure about that; it could be wrong, and I'm open to
> the possibility that it is, in fact, wrong.

*My* gut feeling is that you're going to have a harder time using CPU
time efficiently when doing parallel compression via multiple processes
and independent connections. You're e.g. going to have a lot more
context switches, I think. And there will be network overhead from doing
more connections (including worse congestion control).

> Leaving out all the three and four digit wall times from your table:
>
> > method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format
> > pigz 1 10 34.35 364.14 23.55 3892401867 16.6 .gz
> > zstd 1 1 82.95 67.97 11.82 2853193736 22.6 .zstd
> > zstd 1 10 25.05 151.84 13.35 2847414913 22.7 .zstd
> > zstd 6 10 43.47 374.30 12.37 2745211100 23.5 .zstd
> > zstd 6 20 32.50 468.18 13.44 2745211100 23.5 .zstd
> > zstd 9 20 57.99 949.91 14.13 2606535138 24.8 .zstd
> > lz4 1 1 49.94 36.60 13.33 7318668265 8.8 .lz4
> > pixz 1 10 92.54 925.52 37.00 1199499772 53.8 .xz
>
> It's notable that almost all of the fast wall times here are with
> zstd; the surviving entries with pigz and pixz are with ten-way
> parallelism, and both pigz and lz4 have worse compression ratios than
> zstd. My impression, though, is that LZ4 might be getting a bit of a
> raw deal here because of the repetitive nature of the data. I theorize
> based on some reading I did yesterday, and general hand-waving, that
> maybe the compression ratios would be closer together on a more
> realistic data set.

I agree that most datasets won't get even close to what we've seen
here. And that disadvantages e.g. lz4.

To come up with a much less compressible case, I generated data the
following way:

CREATE TABLE random_data(id serial NOT NULL, r1 float not null, r2 float not null, r3 float not null);
ALTER TABLE random_data SET (FILLFACTOR = 100);
ALTER SEQUENCE random_data_id_seq CACHE 1024
-- with pgbench, I ran this in parallel for 100s
INSERT INTO random_data(r1,r2,r3) SELECT random(), random(), random() FROM generate_series(1, 100000);
-- then created indexes, using a high fillfactor to ensure few zeroed out parts
ALTER TABLE random_data ADD CONSTRAINT random_data_id_pkey PRIMARY KEY(id) WITH (FILLFACTOR = 100);
CREATE INDEX random_data_r1 ON random_data(r1) WITH (fillfactor = 100);

this results in a 16GB base backup. I think this is probably a good bit
less compressible than most PG databases.

method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format
gzip 1 1 305.37 299.72 5.52 7067232465 2.28
lz4 1 1 33.26 27.26 5.99 8961063439 1.80 .lz4
lz4 3 1 188.50 182.91 5.58 8204501460 1.97 .lz4
zstd 1 1 66.41 58.38 6.04 6925634128 2.33 .zstd
zstd 1 10 9.64 67.04 4.82 6980075316 2.31 .zstd
zstd 3 1 122.04 115.79 6.24 6440274143 2.50 .zstd
zstd 3 10 13.65 106.11 5.64 6438439095 2.51 .zstd
zstd 9 10 100.06 955.63 6.79 5963827497 2.71 .zstd
zstd 15 10 259.84 2491.39 8.88 5912617243 2.73 .zstd
pixz 1 10 162.59 1626.61 15.52 5350138420 3.02 .xz
plzip 1 20 135.54 2705.28 9.25 5270033640 3.06 .lz

> It's also notable that lz1 -1 is BY FAR the winner in terms of
> absolute CPU consumption. So I kinda wonder whether supporting both
> LZ4 and ZSTD might be the way to go, especially since once we have the
> LZ4 code we might be able to use it for other things, too.

Yea. I think the case for lz4 is far stronger in other
places. E.g. having lz4 -1 for toast can make a lot of sense, suddenly
repeated detoasting is much less of an issue, while still achieving
higher compression than pglz.

.oO(Now I really see how pglz compares to the above)

> > One thing this reminded me of is whether using a format (tar) that
> > doesn't allow efficient addressing of individual files is a good idea
> > for base backups. The compression rates very likely will be better when
> > not compressing tiny files individually, but at the same time it'd be
> > very useful to be able to access individual files more efficiently than
> > O(N). I can imagine that being important for some cases of incremental
> > backup assembly.
>
> Yeah, being able to operate directly on the compressed version of the
> file would be very useful, but I'm not sure that we have great options
> available there. I think the only widely-used format that supports
> that is ".zip", and I'm not too sure about emitting zip files.

I don't really see a problem with emitting .zip files. It's an extremely
widely used container format for all sorts of file formats these days.
Except for needing a bit more complicated (and I don't think it's *that*
big of a difference) code during generation / unpacking, it seems
clearly advantageous over .tar.gz etc.

> Apparently, pixz also supports random access to archive members, and
> it did have on entry that survived my arbitrary cut in the table
> above, but the last release was in 2015, and it seems to be only a
> command-line tool, not a library. It also depends on libarchive and
> liblzma, which is not awful, but I'm not sure we want to suck in that
> many dependencies. But that's really a secondary thing: I can't
> imagine us depending on something that hasn't had a release in 5
> years, and has less than 300 total commits.

Oh, yea. I just looked at the various tools I could find that did
parallel compression.

> Other options include, perhaps, (1) emitting a tarfile of compressed
> files instead of a compressed tarfile

Yea, that'd help some. Although I am not sure how good the tooling to
seek through tarfiles in an O(files) rather than O(bytes) manner is.

I think there some cases where using separate compression state for each
file would hurt us. Some of the archive formats have support for reusing
compression state, but I don't know which.

> , and (2) writing our own index files. We don't know when we begin
> emitting the tarfile what files we're going to find our how big they
> will be, so we can't really emit a directory at the beginning of the
> file. Even if we thought we knew, files can disappear or be truncated
> before we get around to archiving them. However, when we reach the end
> of the file, we do know what we included and how big it was, so
> possibly we could generate an index for each tar file, or include
> something in the backup manifest.

Hm. There's some appeal to just store offsets in the manifest, and to
make sure it's a seakable offset in the compression stream. OTOH, it
makes it pretty hard for other tools to generate a compatible archive.

> > The other big benefit is that zstd's library has multi-threaded
> > compression built in, whereas that's not the case for other libraries
> > that I am aware of.
>
> Wouldn't it be a problem to let the backend become multi-threaded, at
> least on Windows?

We already have threads in windows, e.g. the signal handler emulation
stuff runs in one. Are you thinking of this bit in postmaster.c:

#ifdef HAVE_PTHREAD_IS_THREADED_NP

/*
* On macOS, libintl replaces setlocale() with a version that calls
* CFLocaleCopyCurrent() when its second argument is "" and every relevant
* environment variable is unset or empty. CFLocaleCopyCurrent() makes
* the process multithreaded. The postmaster calls sigprocmask() and
* calls fork() without an immediate exec(), both of which have undefined
* behavior in a multithreaded program. A multithreaded postmaster is the
* normal case on Windows, which offers neither fork() nor sigprocmask().
*/
if (pthread_is_threaded_np() != 0)
ereport(FATAL,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("postmaster became multithreaded during startup"),
errhint("Set the LC_ALL environment variable to a valid locale.")));
#endif

?

I don't really see any of the concerns there to apply for the base
backup case.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-04-22 15:42:31 Re: More efficient RI checks - take 2
Previous Message Pavel Stehule 2020-04-22 14:50:28 Re: [Proposal] Global temporary tables