Re: Logging corruption error codes

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Анна Крханбарова <annkpx(at)yandex-team(dot)ru>, Dmitriy Sarafannikov <dsarafan(at)yandex-team(dot)ru>
Subject: Re: Logging corruption error codes
Date: 2019-07-25 10:45:00
Message-ID: FB0BEAE7-F856-44D6-9130-C8EFD964D1D0@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> 22 июля 2019 г., в 16:16, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> написал(а):
>
> On 2019-06-20 11:57, Andrey Borodin wrote:
>> We are fine-tuning our data corruption monitoring and found out that many corruption cases do not report proper error code.
>> This makes automatic log analyzer way too smart program.
>> We think that corruption error codes should be given in cases when B-tree or TOAST do not know how to interpret data.
>> PFA patch with cases that we have found in logs and consider evidence of corruption.
>>
>> Best regards, Andrey Borodin.
>
> Should we use errmsg_internal() in the adjusted calls, so that the error
> messages are not picked up for translation? I could go either way, but
> it's something that should be considered.

Thanks for looking into this.

From my POV these messages provide meaningful information to cope with corruption. But they are definitely internal.
Translations already provide some information on toast chunks, mentions btree many times times and many other internal things.
So, I'm confused about status of these messages.
Such messages should be rare enough and those to whom they are addressed should be familiar with English.

We've encountered few more cases of messages, that potentially follow data corruption. In our test environment, we were experimenting with custom Linux kernel that had page cache bug. The bug manifested itself in reappearing stale page versions. This causes various data corruptions, always undetected by data checksums (do we want Merkle tree?).

Besides messages in this patch we also had:
could not read block 1751 in file "base/16452/358336": Bad address // Probably mostly not only data corruption, but hardware fault
t_xmin is uncommitted in tuple to be updated // Probably on-disk corruption
failed to re-find parent key in index // Probably index corruption
left link changed unexpectedly in block // Probably on-disk data corruption
right sibling 45056 of block * is not next child * of block * in index // Definitely index corruption

Should I add corruption codes for these messages in the patch? Or make a separate discussion about these?

Thanks!

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2019-07-25 15:59:18 BUG #15924: Query Execution and variable declaration
Previous Message Michael Paquier 2019-07-25 10:31:13 Re: REINDEX CONCURRENTLY causes ALTER TABLE to fail