Re: Allow users to choose what happens when recovery target is not reached

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "Bharath Rupireddy" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, "Julien Rouhaud" <rjuju123(at)gmail(dot)com>
Cc: "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Allow users to choose what happens when recovery target is not reached
Date: 2021-11-13 15:45:31
Message-ID: 42f7e161-cbcb-42d8-acc9-3049f2275982@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 13, 2021, at 10:15 AM, Bharath Rupireddy wrote:
> Firstly, the proposed patch adds no new behaviour as such, it just
> gives the ability that is existing today on v12 and below (prior to
> commit dc78866 which went into v13 and later).
It reintroduces an awkward behavior [1].

> I think performing PITR is the user's wish - whether the primary is
> available or not, it is completely the user's choice. The user might
> start the PITR, when the primary is available, thinking that it sends
> all the WAL files required for achieving recovery target. But imagine
> a disaster happens and the primary server crashes, say the recovery
> has replayed a huge bunch of WAL records (a TB may be), and the
> primary failed without sending the last one or few WAL files, should
> the PITR target server be failing this case after replaying a huge
> bunch of WAL records? The user might want the target server to be
> available instead of FATALly shutting down. This is the exact problem
> the proposed patch is trying to solve.
Are you archiving on the primary server? You are risking your customer's
business suggesting such setup. You should store the WAL files on your backup
server.

It seems your setup has a flaw. You set a recovery target but accept a scenario
that is not what you initially asked for. If it is a real PITR, it is awkward
like Peter [1] said. You could validate your recovery settings checking the
timestamp of the last WAL file as a rough approximation of the maximum recovery
target time. The other option is to run pg_waldump to obtain the last commit
timestamp.

If you care about your customer's data, you won't use such option. Otherwise, I
repeat the Julien's question [2]: isn't it better to simply don't specify a target
and let the recovery go as far as possible?

> As I said earlier, the behaviour is not too dangerous as it is not
> something new that the patch is proposing, it exists today in v12 and
> below. In fact, it gives a way out of a "dangerous situation" if the
> user ever gets stuck in it without wasting recovery cycles and compute
> resources, by quickly getting the database to be available(of course,
> the responsibility lies with the user to deal with the missing WAL
> files).
Your proposal seems that the user is shooting in the dark. If a FATAL message
was got it means the user missed the target. Even after that the user accepts
the situation, remove the target parameters and start the server again. I think
promote or even pause might lead to incorrect expectations (if the user doesn't
carefully inspect the log messages).

A disadvantage of this proposal is that if you have it set to 'promote', start
the recovery and the server gets promoted before reaching the target. While
inspecting your server configuration, you realized that you are pointing to the
incorrect archive or the WAL files were not available in time (due to timing
issues). You have no option but start from scratch.

[1] https://postgr.es/m/234a0c50-1160-86c2-4e4b-35e9684f1799%402ndquadrant.com
[2] https://postgr.es/m/CAOBaU_ZDkyoQvEsYT0-p1Hb0m_nGtQJ4tTGm2-Ay6v%3DTCjmsWg%40mail.gmail.com

--
Euler Taveira
EDB https://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2021-11-13 15:50:40 Re: support for MERGE
Previous Message Tom Lane 2021-11-13 15:42:25 Re: Inconsistent error message for varchar(n)