Re: [HACKERS] A design for amcheck heapam verification

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] A design for amcheck heapam verification
Date: 2018-01-22 22:01:15
Message-ID: CAH2-WzmVKiwcNrhYFH9CTLLcmQTMH_xjW=AvxfDKAftmY47QKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 11, 2018 at 2:14 AM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> I like heapam verification functionality and use it right now. So, I'm planning to provide review for this patch, probably, this week.

Great!

> Seems like new check is working 4 orders of magnitudes faster then bt_index_parent_check() and still finds my specific error that bt_index_check() missed.
> From this output I see that there is corruption, but cannot understand:
> 1. What is the scale of corruption
> 2. Are these corruptions related or not

I don't know the answer to either question, and I don't think that
anyone else could provide much more certainty than that, at least when
it comes to the general case. I think it's important to remember why
that is.

When amcheck raises an error, that really should be a rare,
exceptional event. When I ran amcheck on Heroku's platform, that was
what we found - it tended to be some specific software bug in all
cases (turns out that Amazon's EBS is very reliable in the last few
years, at least when it comes to avoiding silent data corruption). In
general, the nature of those problems was very difficult to predict.

The PostgreSQL project strives to provide a database system that never
loses data, and I think that we generally do very well there. It's
probably also true that (for example) Yandex have some very good DBAs,
that take every reasonable step to prevent data loss (validating
hardware, providing substantial redundancy at the storage level, and
so on). We trust the system, and you trust your own operational
procedures, and for the most part everything runs well, because you
(almost) think of everything.

I think that running amcheck at scale is interesting because its very
general approach to validation gives us an opportunity to learn *what
we were wrong about*. Sometimes the reasons will be simple, and some
times they'll be complicated, but they'll always be something that we
tried to account for in some way, and just didn't think of, despite
our best efforts. I know that torn pages can happen, which is a kind
of corruption -- that's why crash recovery replays FPIs. If I knew
what problems amcheck might find, then I probably would have already
found a way to prevent them from happening in the first place - there
are limits to what we can predict. (Google "Ludic fallacy" for more
information on this general idea.)

I try to be humble about these things. Very complicated systems can
have very complicated problems that stay around for a long time
without being discovered. Just ask Intel. While it might be true that
some people will use amcheck as the first line of defense, I think
that it makes much more sense as the last line of defense. So, to
repeat myself -- I just don't know.

> I think an interface to list all or top N error could be useful.

I think that it might be useful if you could specify a limit on how
many errors you'll accept before giving up. I think that it's likely
less useful than you think, though. Once amcheck detects even a single
problem, all bets are off. Or at least any prediction that I might try
to give you now isn't worth much. Theoretically, amcheck should
*never* find any problem, which is actually what happens in the vast
majority of real world cases. When it does find a problem, there
should be some new lesson to be learned. If there isn't some new
insight, then somebody somewhere is doing a bad job.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-01-22 22:15:43 Re: [HACKERS] PoC plpgsql - possibility to force custom or generic plan
Previous Message Tom Lane 2018-01-22 21:56:07 Re: pgsql: Move handling of database properties from pg_dumpall into pg_dum