pg_basebackup's --gzip switch misbehaves

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: pg_basebackup's --gzip switch misbehaves
Date: 2022-09-03 15:11:29
Message-ID: 1400032.1662217889@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been trying to figure out why my new buildfarm animal mamba
occasionally fails the pg_basebackup tests [1][2]. I've not run
that to ground yet, but one thing I've found that's consistently
reproducible everywhere is that pg_basebackup's --gzip switch
misbehaves. The manual says, and the principle of least astonishment
agrees, that that should invoke gzip with the default compression
level. However, the three test cases beginning at about line 810 of
010_pg_basebackup.pl produce these output file sizes on my x86_64
Linux machine:

backup_gzip ("--compress 1"):
total 3672
-rw-r-----. 1 postgres postgres 137756 Sep 2 23:38 backup_manifest
-rw-r-----. 1 postgres postgres 3538992 Sep 2 23:38 base.tar.gz
-rw-r-----. 1 postgres postgres 73991 Sep 2 23:38 pg_wal.tar.gz

backup_gzip2 ("--gzip"):
total 19544
-rw-r-----. 1 postgres postgres 137756 Sep 2 23:38 backup_manifest
-rw-r-----. 1 postgres postgres 3086972 Sep 2 23:38 base.tar.gz
-rw-r-----. 1 postgres postgres 16781399 Sep 2 23:38 pg_wal.tar.gz

backup_gzip3 ("--compress gzip:1"):
total 3672
-rw-r-----. 1 postgres postgres 137756 Sep 2 23:38 backup_manifest
-rw-r-----. 1 postgres postgres 3539006 Sep 2 23:38 base.tar.gz
-rw-r-----. 1 postgres postgres 73989 Sep 2 23:38 pg_wal.tar.gz

It makes sense that base.tar.gz is compressed a little better with
--gzip than with level-1 compression, but why is pg_wal.tar.gz not
compressed at all? It looks like the problem probably boils down to
which of "-1" and "0" means "default behavior" vs "no compression",
with different code layers interpreting that differently. I can't
find exactly where that's happening, but I did manage to stop the
failures with this crude hack:

diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index e90aa0ba37..edddd9b578 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -1358,7 +1358,7 @@ CreateWalTarMethod(const char *tarbase,
sprintf(tar_data->tarfilename, "%s%s", tarbase, suffix);
tar_data->fd = -1;
tar_data->compression_algorithm = compression_algorithm;
- tar_data->compression_level = compression_level;
+ tar_data->compression_level = compression_level > 0 ? compression_level : Z_DEFAULT_COMPRESSION;
tar_data->sync = sync;
#ifdef HAVE_LIBZ
if (compression_algorithm == PG_COMPRESSION_GZIP)

That's not right as a real fix, because it would have the effect
that "--compress gzip:0" would also invoke default compression,
whereas what it should do is produce the uncompressed output
we're actually getting. Both cases have compression_level == 0
by the time we reach here, though.

I suspect that there are related bugs in other code paths in this
rat's nest of undocumented functions and dubious API abstractions;
but since it's all undocumented, who can say which places are wrong
and which are not?

I might not ding this code quite this hard, if I hadn't had
equally-unpleasant encounters with it previously (eg 248c3a937).
It's a mess, and I do not find it to be up to project standards.

A vaguely-related matter is that the deflateParams calls all pass "0"
as the third parameter:

if (deflateParams(tar_data->zp, tar_data->compression_level, 0) != Z_OK)

Aside from being unreadable, that's entirely unwarranted familiarity
with the innards of libz. zlib.h says you should be writing a named
constant, probably Z_DEFAULT_STRATEGY.

BTW, I'm fairly astonished that anyone would have thought that three
complete pg_basebackup cycles testing essentially-identical options
were a good use of developer time and buildfarm cycles from here to
eternity. Even if digging into it did expose a bug, the test case
deserves little credit for that, because it entirely failed to call
attention to the problem. I had to whack the script pretty hard
just to get it to not delete the evidence.

regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-09-01%2018%3A38%3A27
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-08-31%2011%3A46%3A09

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-09-03 15:21:29 Re: Can we avoid chdir'ing in resolve_symlinks() ?
Previous Message Andrew Dunstan 2022-09-03 15:06:54 Re: Can we avoid chdir'ing in resolve_symlinks() ?