pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication

From: RKN Sai Krishna <rknsaiforpostgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: mahendrakarforpg(at)gmail(dot)com, marvin_liang(at)qq(dot)com, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, rohanforpostgres(at)gmail(dot)com
Subject: pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication
Date: 2022-06-22 05:44:34
Message-ID: CAMVpbFNqjXWMt3A4YnA0YggsJRikctBRS9Bo_kJNdZON1VTujw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Problem: Today when a data page is corrupted in the primary postgres with
physical replication (sync or async standbys), there seems to be no way to
repair it easily and we rely on PITR to recreate the postgres server or
drop the corrupted table (of course this is not an option for important
customer tables, but may be okay for some maintenance or temporary tables).
PITR is costly to do in a production environment oftentimes as it involves
creation of the full-blown postgres from the base backup and causing
downtime for the customers.

Solution: Make use of the uncorrupted page present in sync or async
standby. The proposed tool/extension pg_page_repair (as we call it) can
fetch the uncorrupted page from sync or async standby and overwrite the
corrupted one on the primary. Yes, there will be a challenge in making sure
that the WAL is replayed completely and standby is up-to-date so that we
are sure that stale pages are not copied across. A simpler idea could be
that the pg_page_repair can wait until the standby replays/catches up with
the primary's flush LSN before fetching the uncorrupted page. A downside of
this approach is that the pg_page_repair waits for long or rather
infinitely if the replication lag is huge. As we all know that the
replication lag is something a good postgres solution will always monitor
to keep it low, if true, the pg_page_repair is guaranteed to not wait for
longer. Another idea could be that the pg_page_repair gets the base page
from the standby and applies all the WAL records pertaining to the
corrupted page using the base page to get the uncorrupted page. This
requires us to pull the replay logic from the core to pg_page_repair which
isn't easy. Hence we propose to go with approach 1, but open to discuss on
approach 2 as well. We suppose that the solution proposed in this thread
holds good even for pages corresponding to indexes.

Implementation Choices: pg_page_repair can either take the corrupted page
info (db id, rel id, block number etc.) or just a relation name and
automatically figure out the corrupted page using pg_checksums for instance
or just database name and automatically figure out all the corrupted pages.
It can either repair the corrupted pages online (only the corrupted table
is inaccessible, the server continues to run) or take downtime if there are
many corrupted pages.

Future Scope: pg_page_repair can be integrated to the core so that the
postgres will repair the pages automatically without manual intervention.

Other Solutions: We did consider an approach where the tool could obtain
the FPI from WAL and replay till the latest WAL record to repair the page.
But there could be limitations such as FPI and related WAL not being
available in primary/archive location.

Thoughts?

Credits (cc-ed): thanks to SATYANARAYANA NARLAPURAM for initial thoughts
and thanks to Bharath Rupireddy, Chen Liang, mahendrakar s and Rohan Kumar
for internal discussions.

Thanks, RKN

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2022-06-22 06:29:03 Missing reference to pgstat_replslot.c in pgstat.c
Previous Message Masahiko Sawada 2022-06-22 05:38:36 Re: Support logical replication of DDLs