Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Incorrect snapshots while promoting hot standby node when 2PC is used
Date: 2021-05-01 12:35:09
Message-ID: 9AE6D0CE-D423-41A4-831A-2571EFA0FC07@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres!

> 23 апр. 2021 г., в 01:36, Andres Freund <andres(at)anarazel(dot)de> написал(а):
>
> So snapshots within that window will always be "empty", i.e. xmin is
> latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
> look at the procarray, which then leads xmin going back to 588.
>
>
> I think that this can lead to data corruption, because a too new xmin horizon
> could lead to rows from a prepared transaction getting hint bitted as dead (or
> perhaps even pruned, although that's probably harder to hit). Due to the too
> new xmin horizon we won't treat rows by the prepared xact as in-progress, and
> TransactionIdDidCommit() will return false, as the transaction didn't commit
> yet. Which afaict can result in row versions created by the prepared
> transaction being invisible even after the prepared transaction commits.
>
> Without prepared transaction there probably isn't an issue, because there
> shouldn't be any other in-progress xids at that point?

I'm investigating somewhat resemblant case.
We have an OLTP sharded installation where shards are almost always under rebalancing. Data movement is implemented with 2pc.
Switchover happens quite often due to datacenter drills. The installation is running on PostgreSQL 12.6.

In January heapcheck of backup reported some suspicious problems
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=1
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470241, offnum(tuple)=1
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470242, offnum(tuple)=1
...
and so on for ~100 pages - tuples with lp==1 were not frozen.

We froze tuples with pg_dirty_hands and run VACUUM (DSIABLE_PAGE_SKIPPING) on the table.

In the end of the March the same shard stroke again with:
ERROR: Page marked as all-frozen, but found non frozen tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42
....
around ~1040 blocks (starting from the same 1470240!) had non-frozen tuple at lp==42.
I've run
update s3.objects_65 set created = created where ctid = '(1470241,42)' returning *;

After that heapcheck showed VM problem
ERROR: XX001: Found non all-visible tuple. Oid(relation)=18487, blkno(page)=1470240, offnum(tuple)=42
LOCATION: collect_corrupt_items, heap_check.c:186

VACUUM fixed it with warnings.
WARNING: 01000: page is not marked all-visible but visibility map bit is set in relation "objects_65" page 1470240
and failed on next page
ERROR: XX001: found xmin 1650436694 from before relfrozenxid 1752174172
LOCATION: heap_prepare_freeze_tuple, heapam.c:6172

I run update from all tuples in heapcheks ctid list and subsequent vacuum (without page skipping). This satisfied corruption monitoring.

Can this case be related to the problem that you described?

Or, perhaps, it looks more like a hardware problem? Data_checksums are on, but few years ago we observed ssd firmware that was loosing updates, but passing checksums. I'm sure that we would benefit from having separate relation fork for checksums or LSNs.

We observe similar cases 3-5 times a year. To the date no data was lost due to this, but it's somewhat annoying.
BTW I'd say that such things are an argument for back-porting pg_surgery and heapcheck to old versions.

Thanks!

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-05-01 13:55:15 Re: Enhanced error message to include hint messages for redundant options error
Previous Message Julien Rouhaud 2021-05-01 07:24:58 Hook for extensible parsing.