Re: a very minor bug and a couple of comment changes for basebackup.c

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: a very minor bug and a couple of comment changes for basebackup.c
Date: 2023-03-02 07:59:35
Message-ID: ZABXZ/mfr40VLK7w@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 02, 2023 at 03:23:51PM -0500, Robert Haas wrote:
> 0001 fixes what I believe to be a slight logical error in sendFile(),
> introduced by me during the v15 development cycle when I introduced
> the bbsink abstraction. I believe that it is theoretically possible
> for this to cause an assertion failure, although the chances of that
> actually happening seem extremely remote in practice. I don't believe
> there are any consequences worse than that; for instance, I don't
> think this can result in your backup getting corrupted. See the
> proposed commit message for full details. Because I believe that this
> is formally a bug, I am inclined to think that this should be
> back-patched, but I also think it's fairly likely that no one would
> ever notice if we didn't. However, patch authors have been known to be
> wrong about the consequences of their own bugs from time to time, so
> please do let me know if this seems more serious to you than what I'm
> indicating, or conversely if you think it's not a problem at all for
> some reason.

Seems right, I think that you should backpatch that as
VERIFY_CHECKSUMS is the default.

> 0002 removes an old comment from the file that I find useless and
> slightly misleading.

Okay.

> 0003 rewrites a comment about the way that we verify checksums during
> backups. If we get a checksum mismatch, we reread the block and see if
> the perceived problem goes away. If it doesn't, then we report it.
> This is intended as protection against the backup reading a block
> while some other process is in the midst of writing it, but there's no
> guarantee that any concurrent write will finish quickly enough for our
> second read attempt to see the updated contents.

There is more to it: the page LSN is checked before its checksum.
Hence, if the page's LSN is corrupted in a way that it is higher than
sink->bbs_state->startptr, the checksum verification is just skipped
while the page is broken but not reported as such. (Spoiler: this has
been mentioned in the past, and maybe we'd better remove this stuff in
its entirety.)

> The comment claims
> otherwise, and that's false, and I'm getting tired of reading this
> false claim every time I read this code, so I rewrote the comment to
> say what I believe to be true, namely, that our algorithm is flaky and
> we have no good way to fix that right now. I'm pretty sure that Andres
> pointed this problem out when this feature was under discussion, but
> somehow it's still like this. There's another nearby comment which is
> also false or at least misleading for basically the same reasons which
> probably should be rewritten too, but I was a bit less certain how to
> rewrite it and it wasn't making me as annoyed as this one, so for now
> I only rewrote the one.

Indeed, that's an improvement ;)
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shiy.fnst@fujitsu.com 2023-03-02 08:06:53 RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Previous Message Drouvot, Bertrand 2023-03-02 07:39:14 Re: Generate pg_stat_get_xact*() functions with Macros