Deficient error handling in pg_dump and pg_basebackup

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Deficient error handling in pg_dump and pg_basebackup
Date: 2021-11-09 20:20:31
Message-ID: 1343113.1636489231@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Coverity complained about this code in bbstreamer_file.c,
saying that the get_gz_error() call could be accessing freed
memory:

if (gzclose(mystreamer->gzfile) != 0)
{
pg_log_error("could not close compressed file \"%s\": %s",
mystreamer->pathname,
get_gz_error(mystreamer->gzfile));
exit(1);
}

I think it's right. This certainly isn't sanctioned by the zlib
specification [1], nor does it look like our other gzclose() calls,
which simply consult errno. (Note that this coding is not at all new,
but Coverity thinks it is because 23a1c6578 moved it to a new file.)

I set out to fix that, intending only to replace the use of
get_gz_error() with %m, but the patch soon metastasized to the
point where I think it needs review :-(.

0001 below addresses places that either didn't check the result of
gzclose() at all, or neglected to print a useful error message.
The zlib spec doesn't quite promise that a failed gzclose() will
set errno, but the cases where it might not seem like can't-happen
cases that we needn't spend much effort on. So what I've done here
is just to initialize errno to zero before gzclose(). If we ever
get a report of "could not close file: Success" in one of these
places, we'll know to look more closely.

0002 below addresses a rather messier problem, which is that the
error handling in src/bin/pg_basebackup/walmethods.c is just
appallingly bad. The design is awful, because it relies on errno
to hold still for much longer than we can sanely expect --- notably,
walmethods.c really shouldn't be assuming much about what the callers
might do between calling an error-generating method and calling
getlasterror(). Worse, there are a lot of places that naively expect
errno to hold still across free() and similar operations. We know
that not to be the case on (at least) macOS. On top of that,
commit babbbb595 (LZ4 support) broke the design completely, because
liblz4 doesn't use errno to report errors. Separately from the
question of how to report errors, there were a number of places
that failed to detect errors in the first place, and more that were
sloppy about setting the correct error code for a short write.

What I chose to do to fix the design problem is to use a modified
version of the idea that existed in the tar-methods part of the file,
which is to have both a string field and an errno field in the
DirectoryMethodData and TarMethodData objects. If the string isn't
NULL then it overrides the errno field, allowing us to report
non-errno problems. Then, a bunch of places have to remember to copy
errno into the lasterrno field, which is a bit annoying. On the other
hand, we can get rid of logic that tries to save and restore errno
across functions that might change it, so this does buy back some
complexity too.

There's at least one loose end here, which is that there's one
place in tar_close() that does a summary exit(1) on fsync failure,
without even bothering with an error message. I added the
missing error report:

/* Always fsync on close, so the padding gets fsynced */
if (tar_sync(f) < 0)
+ {
+ /* XXX this seems pretty bogus; why is only this case fatal? */
+ pg_log_fatal("could not fsync file \"%s\": %s",
+ tf->pathname, tar_getlasterror());
exit(1);
+ }

but this code seems about as fishy and ill-thought-out as anything
else I've touched here. Why is this different from the half-dozen
other fsync-error cases in the same file? Why, if fsync failure
here is so catastrophic, is it okay to just return a normal failure
code when tar_close doesn't even get to this point because of some
earlier error? At the very least it seems like it'd be preferable
to do the exit(1) at the caller level.

The addition of the exit(1) seems to have been intentional in
1e2fddfa3, so cc'ing Peter for comment.

regards, tom lane

[1] https://refspecs.linuxbase.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/zlib-gzclose-1.html

Attachment Content-Type Size
0001-improve-close-error-handling-1.patch text/x-diff 3.9 KB
0002-improve-walmethods-error-handling-1.patch text/x-diff 17.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2021-11-09 20:45:34 Re: 2021-11-11 release announcement draft
Previous Message Justin Pryzby 2021-11-09 20:19:37 Re: 2021-11-11 release announcement draft