Check invalid pages at the end of recovery to alarm lost data

From: 王伟(学弈) <rogers(dot)ww(at)alibaba-inc(dot)com>
To: "pgsql-hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Check invalid pages at the end of recovery to alarm lost data
Date: 2023-07-10 07:53:13
Message-ID: 26f24fc1-02a4-464c-87ac-ac52f76baa56.rogers.ww@alibaba-inc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

hello, all.
Recently, I find one very strange situation to lose data of primary node which the
details can be find at the first patch: 0001-Add-test-case-data-lost-after-restart.patch.

The first patch shows us that data could be lost after truncating physical file by
someone else before starting up primary node. However, then the primary node
still starts up normally without any alarm, even that it find any invalid page
during crash recovery.

And then I find another situation about abort transaction which details can be find
at the second patch: 0002-Add-test-case-for-abort-transaction-across-checkpoin.patch.

The second patch shows us that abort transaction across checkpoint could also cause
invalid pages, and leave some undeleted relation files forever during crash recovery.
And then the primary node still starts up normally without any alarm, just like the
first situation.

By the way, the above experiments are both running after setting the following
parameters:
$node_primary->append_conf('postgresql.conf', 'synchronous_commit=on');
$node_primary->append_conf('postgresql.conf', 'full_page_writes=off');
$node_primary->append_conf('postgresql.conf', 'log_min_messages=debug2');

As my opinion, the primary node should alarm some invalid pages found during
crash recovery, as same as what the standby node does after reached consistency
recovery state. So I put the third bug fix patch which is
0003-Check-invalid-pages-at-the-end-of-recovery.patch to do the following two things:
(1) Primary node checks invalid pages at the end of recovery;
(2) Flush the abort WAL before truncating or deleting any relation files.

Best wishes,
rogers.ww.

Attachment Content-Type Size
0001-Add-test-case-data-lost-after-restart.patch application/octet-stream 2.8 KB
0002-Add-test-case-for-abort-transaction-across-checkpoin.patch application/octet-stream 3.4 KB
0003-Check-invalid-pages-at-the-end-of-recovery.patch application/octet-stream 2.4 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2023-07-10 07:57:11 Re: add non-option reordering to in-tree getopt_long
Previous Message o.tselebrovskiy 2023-07-10 07:51:29 Valgrind errors on 32-bit OS