Quick Links

Re: reporting TID/table with corruption error

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: reporting TID/table with corruption error
Date:	2021-08-20 05:45:44
Message-ID:	35157E10-9F41-405E-9719-04626B8A05F7@yandex-team.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> 19 авг. 2021 г., в 23:19, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> написал(а):
>
>
>
>> On Aug 19, 2021, at 10:57 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>>
>> High
>> verbosity makes a lot of sense here.
>
> Works for me. We could create another function, "verify_heapam_full" perhaps, that returns additional columns matching those from pageinspect's heap_page_items():

Currently I'm mostly interested in index functions to investigate CIC bug.
I see 4 different cases for corruption checks:
1. Developer tackling a bug
2. Backup smoke test
3. DBA recovering corrupted data
4. Running system detects anomaly

In case 1 you want to find corruption and trace back events that lead to it. You need all the bits that can connect current state with events in the past.

In case 2 you want succinct check, that in case of fire provides initial information for case 3. Ideally you want a check that is symbiosis of "all indexed" check and heap check. Meanwhile, it's preferred that you can share heap scan between many index checks.

In case 3 you want to collect all corrupted data (find everything with same xmin\xmax, or on the same page, or with near xmin\xmax). In this case returning heap page right away would be quite useful.
Sometimes you want to create backup copy of the page to try some surgery. (create table backup_pages as select from verify_heapam_full())

In case 4 you want to alarm DBA and provide all the necessary information to start 3. Adding standardised corruption info to all ERRCODE_DATA_CORRUPTED\ERRCODE_INDEX_CORRUPTED would suffice. Also, when monitoring wakes you at night you want to know:
- How many tuples are corrupted?
- How long ago data was corrupted? Is corrupted data within PITR window yet?
- Where to seek a manual for recovery?
But I don't think we can have this logged in case of "ERROR: t_xmin is uncommitted in tuple to be updated"

Thanks!

Best regards, Andrey Borodin.

In response to

Re: reporting TID/table with corruption error at 2021-08-19 18:19:36 from Mark Dilger

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kyotaro Horiguchi	2021-08-20 06:33:37	Re: pg_veryfybackup can fail with a valid backup for TLI > 1
Previous Message	Michael Paquier	2021-08-20 05:23:07	Re: Two patches to speed up pg_rewind.