Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, arorasam(at)gmail(dot)com
Subject: Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Date: 2021-10-23 04:01:41
Message-ID: CALj2ACWphKWBJUPhtddjcRRqtE7YZh+65hTM_htrBzrZ87QXPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 23, 2021 at 1:46 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote:
> > If the suggestion is to have the wait and retry logic embedded into
> > the user-written restore_command, IMHO, it's not a good idea as the
> > restore_command is external to the core PG and the FATAL error
> > "recovery ended before configured recovery target was reached" is an
> > internal thing.
>
> What do you want to do after the timeout happens? If you want to issue
> a WARNING instead of failing outright, perhaps that makes sense for
> exploratory PITR cases. That could be a simple boolean GUC without
> needing to introduce the timeout logic into the server.

If you are suggesting to give the user more control on what should
happen to the standby even after the timeout, then, the 2 new GUCs
recovery_target_retry_timeout (int) and
recovery_target_continue_after_timeout (bool) will really help users
choose what they want. I'm not sure if it is okay to have 2 new GUCs.
Let's hear from other hackers what they think about this.

> I think it's an interesting point that it can be hard to choose a
> reasonable recovery target if the system is completely down. We could
> use some better tooling or metadata around the lsns, xids or timestamp
> ranges available in a pg_wal directory or an archive. Even better would
> be to see the available named restore points. This would make is easier
> to calculate how long recovery might take for a given restore point, or
> whether it's not going to work at all because there's not enough WAL.

I think pg_waldump can help here to do some exploratory analysis of
the available WAL in the directory where the WAL files are present.
Since it is an independent C program, it can run even when the server
is down and also run on archive location.

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-10-23 07:44:00 Re: pg_receivewal starting position
Previous Message Noah Misch 2021-10-23 01:42:35 Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)