Re: Reading timeline from pg_control on replication slave

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reading timeline from pg_control on replication slave
Date: 2017-10-27 22:09:34
Message-ID: CAB7nPqSA5jFch+_kYNV1cu2WpJ1NGysx48pbBh+7M7kkTLNWCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 27, 2017 at 1:04 AM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> I'm working on backups from replication salve in WAL-G [0]
> Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to pg_start_backup() works nice, but "pg_walfile_name() cannot be executed during recovery."
> This function has LSN as argument and reads TimeLineId from global state.
> So I made a function[1] that, if on replica, reads timeline from pg_control file and formats WAL file name as is it was produces by pg_wal_filename(lsn).

ThisTimeLineID is not something you can rely on for standby backends
as it is not set during recovery. That's the reason behind
pg_walfile_name disabled during recovery. There are three things
popping on top of my mind that one could think about:
1) Backups cannot be completed when started on a standby in recovery
and when stopped after the standby has been promoted, meaning that its
timeline has changed.
2) After a standby has been promoted, by using pg_start_backup, you
issue a checkpoint which makes sure that the control file gets flushed
with the new information, so when pg_start_backup returns to the
caller you should have the correct timeline number because the outer
function gets evaluated last.
3) Backups taken from cascading standbys, where a direct parent has
been promoted.

1) and 2) are actually not problems per the restrictions I am giving
above, but 3) is. If I recall correctly, when a streaming standby does
a timeline jump, a restart point is not immediately generated, so you
could have the timeline on the control file not updated to the latest
timeline value, meaning that you could have the WAL file name you use
here referring to a previous timeline and not the newest one.

In short, yes, what you are doing is definitely risky in my opinion,
particularly for complex cascading setups.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-10-27 22:31:44 Re: MERGE SQL Statement for PG11
Previous Message Serge Rielau 2017-10-27 22:00:47 Re: MERGE SQL Statement for PG11