Re: pg_stop_backup(wait_for_archive := true) on standby server

From: David Steele <david(at)pgmasters(dot)net>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_stop_backup(wait_for_archive := true) on standby server
Date: 2017-07-24 19:28:07
Message-ID: bdd9cf10-879b-7066-dbc7-1f3983627bf2@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/24/17 12:28 PM, Stephen Frost wrote:
>
> * David Steele (david(at)pgmasters(dot)net) wrote:
>
>> While this patch brings pg_stop_backup() in line with the
>> documentation, it also introduces a behavioral change compared to
>> 9.6. Currently, the default 9.6 behavior on a standby is to return
>> immediately from pg_stop_backup(), but this patch will cause it to
>> block by default instead. Since action on the master may be
>> required to unblock the process, I see this as a pretty significant
>> change. I still think we should fix and backpatch, but I wanted to
>> point this out.
>
> This will need to be highlighted in the release notes for 9.6.4 also,
> assuming there is agreement to back-patch this, and we'll need to update
> the documentation in 9.6 also to be clearer about what happens on a
> standby.

Agreed.

>> "If the WAL segments required for this backup have not been archived
>> then it might be necessary to force a segment switch on the
>> primary."
>
> I'm a bit on the fence regarding the distinction here for the
> backup-from-standby and this errhint(). The issue is that the WAL for
> the backup hasn't been archived yet and that may be because of either:
>
> archive_command is failing
> OR
> No WAL is getting generated
>
> In either case, both of these apply to the primary and the standby. As
> such, I'm inclined to try to mention both in the original errhint()
> instead of making two different ones. I'm open to suggestions on this,
> of course, but my thinking would be more like:
>
> -----
> Either archive_command is failing or not enough WAL has been generated
> to require a segment switch. Run pg_switch_wal() to request a WAL
> switch and monitor your logs to check that your archive_command is
> executing properly.
> -----

Yes, that seems more concise. I like the idea of not having to maintain
two separate hints.

> And then change pg_switch_wal()'s errhint for when it's run on a replica
> to be:
>
> ----
> HINT: WAL control functions cannont be executed during recovery; they
> should be executed on the primary instead.
> ----

Looks good to me. This explanation is useful in general.

>> 2) At backup.sgml L1015 it says that pg_stop_backup() will
>> automatically switch the WAL segment. There should be a caveat here
>> for standby backups, like:
>>
>> When called on a master it terminates the backup mode and performs
>> an automatic switch to the next WAL segment. The reason for the
>> switch is to arrange for the last WAL segment written during the
>> backup interval to be ready to archive. When called on a standby
>> the WAL segment switch must be performed manually on the master if
>> it does not happen due to normal write activity.
>
> s/master/primary/g

Yes.

>> 3) The fact that this fix requires "archive_mode = always" seems
>> like a pretty big caveat, thought I don't have any ideas on how to
>> make it better without big code changes. Maybe it would be help to
>> change:
>>
>> + the backup is taken on a standby, <function>pg_stop_backup</> waits
>> + for WAL to be archived when <varname>archive_mode</>
>>
>> To:
>>
>> + the backup is taken on a standby, <function>pg_stop_backup</> waits
>> + for WAL to be archived *only* when <varname>archive_mode</>
>
> I'm thinking of rewording that a bit to say "When archive_mode = always"
> instead, as I think that might be a little clearer.

Sure.

>> Perhaps this should be noted in the pg_basebackup docs as well?
>
> That seems like it's probably a good idea too, as people might wonder
> why pg_basebackup hasn't finished yet if it's waiting for WAL to be
> archived.

Yes, and this is another behavioral change to consider -- one that is
more likely to impact users than the change to pg_stop_backup(). If
pg_basebackup is not currently waiting for WAL on a standby (but does on
a primary) then that's pretty serious.

Any thoughts on this, Magnus?

--
-David
david(at)pgmasters(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-07-24 19:43:15 Re: pg_stop_backup(wait_for_archive := true) on standby server
Previous Message Thomas Munro 2017-07-24 19:27:29 Re: Buildfarm failure and dubious coding in predicate.c