Re: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: RKN Sai Krishna <rknsaiforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, mahendrakarforpg(at)gmail(dot)com, marvin_liang(at)qq(dot)com, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, rohanforpostgres(at)gmail(dot)com
Subject: Re: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication
Date: 2022-06-22 07:17:35
Message-ID: CAD21AoCrtO8gBORbHVKu3ju9hS953xwWnc1HauwXSejOWji75Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Jun 22, 2022 at 2:44 PM RKN Sai Krishna
<rknsaiforpostgres(at)gmail(dot)com> wrote:
>
> Hi,
>
> Problem: Today when a data page is corrupted in the primary postgres with physical replication (sync or async standbys), there seems to be no way to repair it easily and we rely on PITR to recreate the postgres server or drop the corrupted table (of course this is not an option for important customer tables, but may be okay for some maintenance or temporary tables). PITR is costly to do in a production environment oftentimes as it involves creation of the full-blown postgres from the base backup and causing downtime for the customers.
>
> Solution: Make use of the uncorrupted page present in sync or async standby. The proposed tool/extension pg_page_repair (as we call it) can fetch the uncorrupted page from sync or async standby and overwrite the corrupted one on the primary. Yes, there will be a challenge in making sure that the WAL is replayed completely and standby is up-to-date so that we are sure that stale pages are not copied across. A simpler idea could be that the pg_page_repair can wait until the standby replays/catches up with the primary's flush LSN before fetching the uncorrupted page. A downside of this approach is that the pg_page_repair waits for long or rather infinitely if the replication lag is huge. As we all know that the replication lag is something a good postgres solution will always monitor to keep it low, if true, the pg_page_repair is guaranteed to not wait for longer. Another idea could be that the pg_page_repair gets the base page from the standby and applies all the WAL records pertaining to the corrupted page using the base page to get the uncorrupted page. This requires us to pull the replay logic from the core to pg_page_repair which isn't easy. Hence we propose to go with approach 1, but open to discuss on approach 2 as well. We suppose that the solution proposed in this thread holds good even for pages corresponding to indexes.

I'm interested in this topic and recalled I did some research on the
first idea while writing experimental code several years ago[1].

The corruption that can be fixed by this feature is mainly physical
corruption, for example, introduced by storage array cache corruption,
array firmware bugs, filesystem bugs, is that right? Logically corrupt
blocks are much more likely to have been introduced as a result of a
failure or a bug in PostgreSQL, which would end up propagating to
physical standbys.

>
> Implementation Choices: pg_page_repair can either take the corrupted page info (db id, rel id, block number etc.) or just a relation name and automatically figure out the corrupted page using pg_checksums for instance or just database name and automatically figure out all the corrupted pages. It can either repair the corrupted pages online (only the corrupted table is inaccessible, the server continues to run) or take downtime if there are many corrupted pages.

Since the server must be shutdown cleanly before running pg_checksums
if we want to verify checksums of the page while the server is running
we would need to do online checksum verification we discussed
before[2].

>
> Future Scope: pg_page_repair can be integrated to the core so that the postgres will repair the pages automatically without manual intervention.
>
> Other Solutions: We did consider an approach where the tool could obtain the FPI from WAL and replay till the latest WAL record to repair the page. But there could be limitations such as FPI and related WAL not being available in primary/archive location.

How do we find the FPI of the corrupted page effectively from WAL? We
could seek WAL records from backward but it could take a quite long
time.

Regards,

[1] https://github.com/MasahikoSawada/pgtools/tree/master/page_repair
[2] https://www.postgresql.org/message-id/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2022-06-22 07:21:44 Re: Postgres perl module namespace
Previous Message Peter Smith 2022-06-22 06:46:15 Re: Handle infinite recursion in logical replication setup