Re: Online checksums verification in the backend

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online checksums verification in the backend
Date: 2019-12-25 02:23:50
Message-ID: CA+fd4k5BBPfsPVOZ0JrH5G8h1aWL=UmUWimRL+jKovp5BTfspg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>
> On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> > >
> > > This brings the second consideration: how to report the list corrupted
> > > blocks to end users. As I said this is for now returned via the SRF,
> > > but this is clearly not ideal and should rather be made available more
> > > globally. One usage of this information could be block level
> > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and
> > > mentioned me that he had ideas on passing the list of corrupted blocks
> > > using the stat collector.
> >
> > Yes it's necessary the list of corrupted pages for single page
> > recovery. Apart from single page recovery I think it's helpful for DBA
> > if they can find the corrupted blocks in the server logs and on a
> > system view.
> >
> > I've also tried to report corrupted pages to the stats collector
> > during I researching single page recovery in PostgreSQL but one
> > problem is that the statistics in the stats collector is cleared when
> > crash recovery. I want the information of block corruption to survive
> > even when the server down.
>
> Yes, having the list of corrupted blocks surviving a crash-and-restart
> cycle, and also available after a clean shutdown is definitely
> important.
>
> > And we might want to add checksums to the
> > permanent file having information of database corruption. The
> > correctness of these information would be important because we can fix
> > a database by restoring some tables from a logical backup or by doing
> > reindex etc as long as we have a non-broken information of database
> > corruption.
>
> Agreed
>
> > > Finally, the read and locking considerations. I tried to cover that
> > > extensively in the comments, but here are some details on how I tried
> > > to make the check safe while trying to keep the overhead as low as
> > > possible. First thing is that this is only doing buffered reads,
> > > without any attempt to discard OS cache. Therefore, any discrepancy
> > > between the OS cache and the disk cannot be detected unless you do
> > > other actions, such as sync / drop_caches on GNU/Linux.
> > >
> > > An access share lock on the currently checked relation is held,
> > > meaning that it can't get deleted/truncated. The total number of
> > > blocks for the given fork is retrieved first, so any new block will be
> > > ignored. Such new blocks are considered out of scope as being written
> > > after the start of the check.
> > >
> > > Each time a buffer is being checked, the target buffer mapping
> > > partition lock is acquired in shared mode, to prevent concurrent
> > > eviction. If the buffer is found in shared buffers, it's pinned and
> > > released immediately, just to get the state.
> >
> > I wonder if there is possibility that blocks on disk can be corrupted
> > even if these are loaded to the shared buffer. ISTM the above method
> > cannot detect such corruption. Reading and checking blocks fast is
> > attractive but I thought it's also important to check blocks precisely
> > without overlooking.
>
> It can definitely happen, and it's the usual doomsday scenario:
> database is working fine for months, then postgres is restarted say
> for a minor version upgrade and then boom the most populars blocks
> that are constantly used in read only were corrupted on disk but never
> evicted from shared buffers, and you have a major outage. I have
> witnessed that unfortunately too many times. This is especially bad
> as in this kind of scenario, you typically discover the corruption
> once all backup only contains the corrupted blocks.
>
> Note that in the approach I'm suggesting, I do verify blocks that are
> loaded in shared buffers, I only ignore the dirty blocks, as they'll
> be written by the checkpointer or recovery process in case of unclean
> shutdown. A bufferpin isn't necessary to avoid torn page read, an IO
> lock also guarantees that and causes less overhead. The included TAP
> test should also detect the corruption of a
> present-in-shared-buffers-non-dirty block. It could however be
> improved eg. by calling pg_prewarm to make sure that it's indeed in
> shared_buffers, and also do the same test after a clean restart to
> make sure that it's hitting the not-in-shared-buffers case.

It reads blocks from disk even if they are loaded in shared buffer.
Now I understand. Thanks!

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-12-25 02:25:54 Re: pgsql: Superuser can permit passwordless connections on postgres_fdw
Previous Message Michael Paquier 2019-12-25 02:22:03 Re: Assert failure due to "drop schema pg_temp_3 cascade" for temporary tables and \d+ is not showing any info after drooping temp table schema