Re: Online verification of checksums

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2019-03-08 17:59:41
Message-ID: CAOBaU_ZaNXN1TMToZG4q4gQZeS7KVUx2_Hsa4MpAwhbu1N-WXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >>
> >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >>>
> >>> But then again, we could just
> >>> hack a special version of ReadBuffer_common() which would just
> >>
> >>> (a) check if a page is in shared buffers, and if it is then consider the
> >>> checksum correct (because in memory it may be stale, and it was read
> >>> successfully so it was OK at that moment)
> >>>
> >>> (b) if it's not in shared buffers already, try reading it and verify the
> >>> checksum, and then just evict it right away (not to spoil sb)
> >>
> >> This'd also make sense and make the whole process more efficient. OTOH,
> >> it might actually be worthwhile to check the on-disk page even if
> >> there's in-memory state. Unless IO is in progress the on-disk page
> >> always should be valid.
> >
> > Definitely. I already saw servers with all-frozen-read-only blocks
> > popular enough to never get evicted in months, and then a minor
> > upgrade / restart having catastrophic consequences.
> >
>
> Do I understand correctly the "catastrophic consequences" here are due
> to data corruption / broken checksums on those on-disk pages?

Ah, yes sorry I should have been clearer. Indeed, there was silent
data corruptions (no ckecksum though) that was revealed by the
restart. So a routine minor update resulted in a massive outage.
Such a scenario can't be avoided if we always bypass checksum check
for alreay in shared_buffers pages.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-03-08 18:08:29 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Tomas Vondra 2019-03-08 17:49:56 Re: Online verification of checksums