Re: Deficient error handling in pg_dump and pg_basebackup

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Re: Deficient error handling in pg_dump and pg_basebackup
Date: 2021-11-10 05:03:11
Message-ID: YYtSj5vlWp5faVXz@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 09, 2021 at 03:20:31PM -0500, Tom Lane wrote:
> I think it's right. This certainly isn't sanctioned by the zlib
> specification [1], nor does it look like our other gzclose() calls,
> which simply consult errno. (Note that this coding is not at all new,
> but Coverity thinks it is because 23a1c6578 moved it to a new file.)
>
> I set out to fix that, intending only to replace the use of
> get_gz_error() with %m, but the patch soon metastasized to the
> point where I think it needs review :-(.

This points gzclose_w() to gzwrite.c. Looking at its end, it could
return Z_ERRNO after freeing everything when failing to close the
state's fd. So I agree that coverity is right about this
possibility. That's unlikely going to happen, though.

> 0001 below addresses places that either didn't check the result of
> gzclose() at all, or neglected to print a useful error message.
> The zlib spec doesn't quite promise that a failed gzclose() will
> set errno, but the cases where it might not seem like can't-happen
> cases that we needn't spend much effort on.

The zlib code does not do much regarding that, but system calls
would. Still that joins with the point you are doing below where
free() may not hold errno, and note that the last thing gzclose() does
is a free(). :p

> So what I've done here
> is just to initialize errno to zero before gzclose(). If we ever
> get a report of "could not close file: Success" in one of these
> places, we'll know to look more closely.

Okay.

> On top of that,
> commit babbbb595 (LZ4 support) broke the design completely, because
> liblz4 doesn't use errno to report errors. Separately from the
> question of how to report errors, there were a number of places
> that failed to detect errors in the first place, and more that were
> sloppy about setting the correct error code for a short write.

This one's my fault, sorry about that :/

What you are doing here looks fine for the LZ4F_isError() paths.

> What I chose to do to fix the design problem is to use a modified
> version of the idea that existed in the tar-methods part of the file,
> which is to have both a string field and an errno field in the
> DirectoryMethodData and TarMethodData objects. If the string isn't
> NULL then it overrides the errno field, allowing us to report
> non-errno problems. Then, a bunch of places have to remember to copy
> errno into the lasterrno field, which is a bit annoying. On the other
> hand, we can get rid of logic that tries to save and restore errno
> across functions that might change it, so this does buy back some
> complexity too.

This makes the code a bit harder to follow when it comes to cascading
calls with tar_write_*() or tar_close(), but your approach looks
simple enough for this code. I don't think you have missed a spot
after scanning the whole area.

> There's at least one loose end here, which is that there's one
> place in tar_close() that does a summary exit(1) on fsync failure,
> without even bothering with an error message. I added the
> missing error report:
>
> /* Always fsync on close, so the padding gets fsynced */
> if (tar_sync(f) < 0)
> + {
> + /* XXX this seems pretty bogus; why is only this case fatal? */
> + pg_log_fatal("could not fsync file \"%s\": %s",
> + tf->pathname, tar_getlasterror());
> exit(1);
> + }

> but this code seems about as fishy and ill-thought-out as anything
> else I've touched here. Why is this different from the half-dozen
> other fsync-error cases in the same file? Why, if fsync failure
> here is so catastrophic, is it okay to just return a normal failure
> code when tar_close doesn't even get to this point because of some
> earlier error?

Hmm, I don't think that's fine. There is an extra one in tar_finish()
that would report a failure, but not exit() immediately. All the
callers of the sync callback in receivelog.c exit() on sight, but I am
wondering if it would not be saner to just exit() from walmethods.c
each time we see a failure with a fsync().

> At the very least it seems like it'd be preferable
> to do the exit(1) at the caller level.

You mean walmethods.c here, right?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-11-10 05:46:38 Re: Identify missing publications from publisher while create/alter subscription.
Previous Message Greg Nancarrow 2021-11-10 04:31:19 Re: Optionally automatically disable logical replication subscriptions on error