Re: [PATCH] Verify Checksums during Basebackups

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Verify Checksums during Basebackups
Date: 2018-03-05 11:36:14
Message-ID: 20180305113614.GW2416@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Michael,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> On Sun, Mar 04, 2018 at 06:19:00PM +0100, Magnus Hagander wrote:
> > So sure, if we go with WARNING + exit with an errorcode, that is perhaps
> > the best combination of the two.
>
> I had a look at how to go about this, but it appears to be a bit
> complicated; the first problem is that sendFile() and sendDir() don't
> have status return codes that could be set on checksum verifcation
> failure. So I added a global variable and threw an ereport(ERROR) at the
> end of perform_base_backup(), but then I realized that `pg_basebackup'
> the client program purges the datadir it created if it gets an error:
>
> |pg_basebackup: final receive failed: ERROR: Checksum mismatch during
> |basebackup
> |
> |pg_basebackup: removing data directory "data2"

Oh, ugh.

> So I guess this would have to be sent back via the replication protocol,
> but I don't see an off-hand way to do this easily?

The final ordinary result set could be extended to include the
information about checksum failures..? I'm a bit concerned about what
to do when there are a lot of checksum failures though.. Ideally, you'd
identify all of the pages in all of the files where a checksum failed
(just throwing an error such as the one proposed above is really rather
terrible since you have no idea what block, or even what table, failed
the checksum...).

I realize this is moving the goalposts a long way, but I had actually
always envisioned having file-by-file pg_basebackup being put in at some
point, instead of tablespace-by-tablespace, which would allow for both
an ordinary result set being returned for each file that could contain
information such as the checksum failure and what pages failed, and be a
stepping stone for parallel pg_basebackup..

> Another option would be to see whether it is possible to verify the
> checksum on the client side, but then only pg_basebackup (and no other
> possible external tools using BASE_BACKUP) would profit.

Reviewing the original patch and considering this issue, I believe there
may be a larger problem- while very unlikely, there's been concern that
it's possible to read a half-written page (and possibly only the second
half) and end up with a checksum failure due to that. In pgBackRest, we
address that by doing another read of the page and by checking the LSN
vs. where we started the backup (if the LSN is more recent than when the
backup started then we don't have to care about the page- it'll be in
the WAL).

If we're going to solve that issue the same way pgBackRest does, then
you'd really have to do it server-side, I'm afraid.. Either that, or
add a way for the client to request individual blocks be re-sent, but
that would be awful difficult for pg_basebackup to update in the
resulting tar file if it was compressed..

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2018-03-05 11:47:17 Re: [PATCH] Verify Checksums during Basebackups
Previous Message Daniel Gustafsson 2018-03-05 11:25:36 Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative