Re: Why don't update minimum recovery point in xact_redo_abort

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: 蔡梦娟(玊于) <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Why don't update minimum recovery point in xact_redo_abort
Date: 2021-07-27 08:26:05
Message-ID: b2385710-92d0-732f-46ed-d3585b3bafd6@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/07/27 2:38, 蔡梦娟(玊于) wrote:
> Hi, all
>
> Recently, I got a PANIC while restarts standby, which can be reproduced by the following steps, based on pg 11:
> 1. begin a transaction in primary node;
> 2. create a table in the transaction;
> 3. insert lots of data into the table;
> 4. do a checkpoint, and restart standby after checkpoint is done in primary node;
> 5. insert/update lots of data into the table again;
> 6. abort the transaction.

I could reproduce the issue by using the similar steps and
disabling full_page_writes, in the master branch.

>
> after step 6, fast shutdown standby node, and then restart standby, you will get a PANIC log, and the backtrace is:
> #0  0x00007fc663e5a277 in raise () from /lib64/libc.so.6
> #1  0x00007fc663e5b968 in abort () from /lib64/libc.so.6
> #2  0x0000000000c89f01 in errfinish (dummy=0) at elog.c:707
> #3  0x0000000000c8cba3 in elog_finish (elevel=22, fmt=0xdccc18 "WAL contains references to invalid pages") at elog.c:1658
> #4  0x00000000005e476a in XLogCheckInvalidPages () at xlogutils.c:253
> #5  0x00000000005cbc1a in CheckRecoveryConsistency () at xlog.c:9477
> #6  0x00000000005ca5c5 in StartupXLOG () at xlog.c:8609
> #7  0x0000000000a025a5 in StartupProcessMain () at startup.c:274
> #8  0x0000000000643a5c in AuxiliaryProcessMain (argc=2, argv=0x7ffe4e4849a0) at bootstrap.c:485
> #9  0x0000000000a00620 in StartChildProcess (type=StartupProcess) at postmaster.c:6215
> #10 0x00000000009f92c6 in PostmasterMain (argc=3, argv=0x4126500) at postmaster.c:1506
> #11 0x00000000008eab64 in main (argc=3, argv=0x4126500) at main.c:232
>
> I think the reason for the above error is as follows:
> 1. the transaction in primary node was aborted finally, the standby node also deleted the table files after replayed the xlog record, however, without updating minimum recovery point;
> 2. primary node did a checkpoint before abort, and then standby node is restarted, so standby node will recovery from a point where the table has already been created and data has been inserted into the table;
> 3. when standby node restarts after step 6, it will find the page needed during recovery doesn't exist, which has already been deleted by xact_redo_abort before, so standby node will treat this page as an invalid page;
> 4. xact_redo_abort drop relation files without updating minumum recovery point, before standby node replay the abort xlog record and forget invalid pages again, it will reach consistency because the abort xlogrecord lsn is greater than minrecoverypoint;
> 5. during checkRecoveryConsistency, it will check invalid pages, and find that there is invalid page, and the PANIC log will be generated.
>
> So why don't update minimum recovery point in xact_redo_abort, just like XLogFlush in xact_redo_commit, in which way standby could reach consistency and check invalid pages after replayed the abort xlogrecord.

ISTM that you're right. xact_redo_abort() should call XLogFlush() to
update the minimum recovery point on truncation. This seems
the oversight in commit 7bffc9b7bf.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-07-27 08:35:51 Re: [bug?] Missed parallel safety checks, and wrong parallel safety
Previous Message Peter Smith 2021-07-27 07:52:00 Re: row filtering for logical replication