Re: Logging corruption error codes

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Анна Крханбарова <annkpx(at)yandex-team(dot)ru>, Dmitriy Sarafannikov <dsarafan(at)yandex-team(dot)ru>
Subject: Re: Logging corruption error codes
Date: 2019-06-21 10:22:15
Message-ID: 28DE958A-DB3B-4266-B960-596B0092FF8E@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> 20 июня 2019 г., в 22:09, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> написал(а):
>
> On 2019-Jun-20, Andrey Borodin wrote:
>
>> Hi!
>>
>> We are fine-tuning our data corruption monitoring and found out that many corruption cases do not report proper error code.
>> This makes automatic log analyzer way too smart program.
>> We think that corruption error codes should be given in cases when B-tree or TOAST do not know how to interpret data.
>> PFA patch with cases that we have found in logs and consider evidence of corruption.
>
> This is not totally insane -- other similar messages such as 'corrupted
> page pointers' in bufpage.c get the same errcode.
On master there is only
elog(ERROR, "incorrect index offsets supplied");
in bufpage.c. But this indicate misuse, not corrupted data on disk.
Others already use ERRCODE_DATA_CORRUPTED.
>
> I would like to have a separate marking for messages indicating a
> system-level permanent problem rather than user error ("table/column X
> does not exist"), retryable condition ("serializability violation"), or
> resource exhaustion ("out of memory", "too many clients"),
Good idea, but there must be standards to which we comply?

> but that's
> probably a separate patch: things like "could not open/read/write file"
> for a data file, or "xlog flush request XYZ not satisfied", and so on,
> which also indicate a kind of corruption.
I believe we should not report hardware problems as corruption. But this worries us (YC) too. Do you think that this problem deserve a patch?
If we introduce new error code - this is, kind of, new feature. Should I send it to pgsql-hackers?

> As you say, currently we have
> to have much too smart programs to weed out the serious errors that
> ought to show up in an alerting system from run-of-the-mill problems.

Thanks!

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2019-06-21 10:23:21 Re: BUG #15789: libpq compilation with OpenSSL 1.1.1b fails on Windows with Visual Studio 2017
Previous Message Pavel Stehule 2019-06-21 08:43:57 Re: segfault during SELECT using && ANY (ARRAY[NULL]::BOX2D).