Re: Logging corruption error codes

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Анна Крханбарова <annkpx(at)yandex-team(dot)ru>, Dmitriy Sarafannikov <dsarafan(at)yandex-team(dot)ru>
Subject: Re: Logging corruption error codes
Date: 2019-07-25 18:27:02
Message-ID: CAH2-Wzk+5G6R3jPYCXKCODWGrMxs0vxKurJcPJdXMp6pOZJ7LQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Jul 25, 2019 at 3:45 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> From my POV these messages provide meaningful information to cope with corruption. But they are definitely internal.
> Translations already provide some information on toast chunks, mentions btree many times times and many other internal things.
> So, I'm confused about status of these messages.
> Such messages should be rare enough and those to whom they are addressed should be familiar with English.

I agree that these don't need to be translated, which means you must
use errmsg_internal() with ereport(). A message like "failed to
re-find parent key in index..." doesn't mean anything to more than a
tiny number of experts. It is useful only because you can paste in
into a search engine. Users will want to search for the English string
anyway.

> This causes various data corruptions, always undetected by data checksums (do we want Merkle tree?).

I don't think that it's possible to verify the integrity of multiple
page images without amcheck support for the access method. It might be
possible to do slightly more in a generic way, but I doubt it.

> Besides messages in this patch we also had:
> could not read block 1751 in file "base/16452/358336": Bad address // Probably mostly not only data corruption, but hardware fault
> t_xmin is uncommitted in tuple to be updated // Probably on-disk corruption
> failed to re-find parent key in index // Probably index corruption
> left link changed unexpectedly in block // Probably on-disk data corruption
> right sibling 45056 of block * is not next child * of block * in index // Definitely index corruption
>
> Should I add corruption codes for these messages in the patch? Or make a separate discussion about these?

I don't think that we need to worry too much about the difference
between data corruption and a hardware fault that could theoretically
self-correct. There is a cost to making fine distinctions like this in
the errcodes we use.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-07-25 22:13:33 Re: AW: BUG #15923: Prepared statements take way too much memory.
Previous Message Jorge Gustavo Rocha 2019-07-25 16:43:47 Re: BUG #15827: Unable to connect on Windows using pg_services.conf using Python psycopg2