Re: [PATCH] Verify Checksums during Basebackups

From: David Steele <david(at)pgmasters(dot)net>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [PATCH] Verify Checksums during Basebackups
Date: 2018-03-23 14:54:08
Message-ID: 1a230b05-6cfc-2557-20cc-09e7a176230c@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi Michael,

On 3/23/18 5:36 AM, Michael Banck wrote:
> Am Donnerstag, den 22.03.2018, 12:22 -0400 schrieb David Steele:
>>
>> + if (phdr->pd_checksum != checksum)
>>
>> I've attached a patch that adds basic retry functionality. It's not
>> terrible efficient since it rereads the entire buffer for any block
>> error. A better way is to keep a bitmap for each block in the buffer,
>> then on retry compare bitmaps. If the block is still bad, report it.
>> If the block was corrected moved on. If a block was good before but is
>> bad on retry it can be ignored.
>
> I have to admit I find it a bit convoluted and non-obvious on first
> reading, but I'll try to check it out some more.

Yeah, I think I was influenced too much by how pgBackRest does things,
which doesn't work as well here. Attached is a simpler version.

> I think it would be very useful if we could come up with a testcase
> showing this problem, but I guess this will be quite hard to hit
> reproducibly, right?

This was brought up by Robert in [1] when discussing validating
checksums during backup. I don't know of any way to reproduce this
issue but it seems perfectly possible, if highly unlikely.

>> + ereport(WARNING,
>> + (errmsg("checksum verification failed in file "
>>
>> I'm worried about how verbose this warning could be since there are
>> 131,072 blocks per segment. It's unlikely to have that many block
>> errors, but users do sometimes put files in PGDATA which look like they
>> should be validated. Since these warnings all go to the server log it
>> could get pretty bad.
>
> We only verify checksums of files in tablespaces, and I don't think
> dropping random files in those is supported in any way, as opposed to
> maybe the top-level PGDATA directory. So I would say that this is not a
> real concern.

Perhaps, but a very corrupt file is still going to spew lots of warnings
into the server log.

>> We should either stop warning after the first failure, or aggregate the
>> failures for a file into a single message.
>
> I agree that major corruption could make the whole output blow up but I
> would prefer to keep this feature simple for now, which implies possibly
> printing out a lot of WARNING or maybe just stopping after the first
> one (or first few, dunno).

In my experience actual block errors are relatively rare, so there
aren't likely to be more than a few in a file. More common are
overwritten or transposed files, rogue files, etc. These produce a lot
of output.

Maybe stop after five?

Regards,
--
-David
david(at)pgmasters(dot)net

[1]
https://www.postgresql.org/message-id/CA%2BTgmobHd%2B-yVJHofSWg%3Dg%2B%3DA3EiCN2wsAiEyj7dj1hhevNq9Q%40mail.gmail.com

Attachment Content-Type Size
reread-v2.patch text/plain 2.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-03-23 14:59:50 Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)
Previous Message Haribabu Kommi 2018-03-23 14:49:28 Re: PQHost() undefined behavior if connecting string contains both host and hostaddr types