add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: arorasam(at)gmail(dot)com
Subject: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Date: 2021-10-20 16:05:44
Message-ID: CALj2ACULyUY_GgCf-MSZQUsvD_Fk_F+79qz0F53b2f_KdugZhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The FATAL error "recovery ended before configured recovery target was
reached" introduced by commit at [1] in PG 14 is causing the standby
to go down after having spent a good amount of time in recovery. There
can be cases where the arrival of required WAL (for reaching recovery
target) from the archive location to the standby may take time and
meanwhile the standby failing with the FATAL error isn't good.
Instead, how about we make the standby wait for a certain amount of
time (with a GUC) so that it can keep looking for the required WAL. If
it gets the required WAL during the wait time, then it succeeds in
reaching the recovery target (no FATAL error of course). If it
doesn't, the timeout occurs and the standby fails with the FATAL
error. The value of the new GUC can probably be set to the average
time it takes for the WAL to reach archive location from the primary +
from archive location to the standby, default 0 i.e. disabled.

I'm attaching a WIP patch. I've tested it on my dev system and the
recovery regression tests are passing with it. I will provide a better
version later, probably with a test case.

Thoughts?

[1] commit dc788668bb269b10a108e87d14fefd1b9301b793

Author: Peter Eisentraut <peter(at)eisentraut(dot)org>
Date: Wed Jan 29 15:43:32 2020 +0100

Fail if recovery target is not reached

Before, if a recovery target is configured, but the archive ended
before the target was reached, recovery would end and the server would
promote without further notice. That was deemed to be pretty wrong.
With this change, if the recovery target is not reached, it is a fatal
error.

Based-on-patch-by: Leif Gunnar Erlandsen <leif(at)lako(dot)no>
Reviewed-by: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Discussion:
https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5(at)lako(dot)no

Regards,
Bharath Rupireddy.

Attachment Content-Type Size
v1-0001-add-retry-mechanism-with-a-GUC-before-failing-the.patch application/octet-stream 4.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-10-20 16:16:47 Re: pgsql: Document XLOG_INCLUDE_XID a little better
Previous Message Andres Freund 2021-10-20 16:01:56 Re: [RFC] building postgres with meson