Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, arorasam(at)gmail(dot)com
Subject: Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Date: 2021-10-22 10:04:36
Message-ID: CALj2ACXbkQE=s+mccU=4Rcg3vgTQ4QfDNsnWN=wgMHodC-FNfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 22, 2021 at 5:54 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote:
> > The FATAL error "recovery ended before configured recovery target
> > was
> > reached" introduced by commit at [1] in PG 14 is causing the standby
> > to go down after having spent a good amount of time in recovery.
> > There
> > can be cases where the arrival of required WAL (for reaching recovery
> > target) from the archive location to the standby may take time and
> > meanwhile the standby failing with the FATAL error isn't good.
> > Instead, how about we make the standby wait for a certain amount of
> > time (with a GUC) so that it can keep looking for the required WAL.
>
> How is archiving configured, and would it be possible to introduce
> logic into the restore_command to handle slow-to-arrive WAL?

Thanks Jeff!

If the suggestion is to have the wait and retry logic embedded into
the user-written restore_command, IMHO, it's not a good idea as the
restore_command is external to the core PG and the FATAL error
"recovery ended before configured recovery target was reached" is an
internal thing. Having the retry logic (controlled with a GUC) within
the core, when the startup process hits the recovery end before the
target, is a better way and it is something the core PG can offer.
With this, the amount of work spent in recovery by the standby isn't
wasted if the GUC is enabled with the right value. The optimal value
someone can set is the average time it takes for the WAL to reach
archive location from the primary + from archive location to the
standby. By default, we can disable the new GUC with value 0 so that
whoever wants can set it.

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-10-22 10:48:04 Re: logical decoding/replication: new functions pg_ls_logicaldir and pg_ls_replslotdir
Previous Message Nitin Jadhav 2021-10-22 09:49:39 Re: Multi-Column List Partitioning