Re: pg_stop_backup(wait_for_archive := true) on standby server

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_stop_backup(wait_for_archive := true) on standby server
Date: 2017-08-06 01:22:34
Message-ID: CA+TgmoagRwoDor0Mg32b=hN2JRGNbT=tM__2qw-_QM5YGwFJ6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 5, 2017 at 4:28 PM, Paul A Jungwirth
<pj(at)illuminatedcomputing(dot)com> wrote:
> I don't have an opinion on the urgency of back-porting a fix, but if
> pg_stop_backup(boolean) allows for inconsistent backups, it does sound
> like a problem on 9.6 too.

It doesn't. The talk about inconsistent backups is, I think, not a
very precise way of speaking. All backups taken by copying the data
directory are inconsistent until you run recovery for long enough to
make them consistent. What we're talking about is whether it's
guaranteed that all WAL segments needed to reach consistency will be
archived by the time pg_stop_backup() returns. However, even if
they're not, it doesn't mean that there's anything wrong with your
backup per se; it just means you need to replay WAL files that were
not in the archive at the time pg_stop_backup() completed when
restoring it.

So, for example, if you keep an archive of all of your WAL files for
PITR purposes, you're fine. IIUC, to have a problem, you have to do
the following:

1. Take a backup from the standby, not the master.
2. Take a backup using pg_start_backup() and pg_stop_backup(), not
pg_basebackup.
3. Instead of keeping all of your WAL files, just keep the absolute
minimum number required to restore the backup.
4. Instead of keeping the WAL files through the LSN returned by
pg_stop_backup(), only keep the ones that were archived before
pg_stop_backup() returned.

All of (1)-(3) are legitimate user choices, although not everyone will
make them. (4) is unfortunately the procedure recommended by our
documentation, which is where the problem comes in. I think it's
pretty lame that the documentation recommends ignoring the return
value of pg_stop_backup(); ISTM that it would be very wise to
recommend cross-checking the return value against the WAL files you're
keeping for the backup. Even if we thought the waiting logic was
working perfectly in every case, having a cross-check on something as
important as backup integrity seems like an awfully good idea.

For example, *even if* we were to apply the patch Michael Paquier
proposed, it doesn't actually do anything except emit a warning when
archive_mode isn't set to always, and that warning could easily be
missed by an automated backup script, and archive_mode=always is
probably not a common setting. So you still have the same problem in
most cases. I think the root of this problem is that commit
7117685461af50f50c03f43e6a622284c8d54694 did only a pretty perfunctory
update of the documentation, as the commit message itself admits:

Only reference documentation is included. The main section on
backup still needs
to be rewritten to cover this, but since that is already scheduled
for a separate
large rewrite, it's not included in this patch.

But it doesn't look like that ever really got done.
https://www.postgresql.org/docs/10/static/continuous-archiving.html#backup-lowlevel-base-backup
really doesn't talk at all about standbys or differences in required
procedures between masters and standbys. It makes statements that are
flagrantly riduculous in the case of a standby, like claiming that
pg_start_backup() always performs a checkpoint and that pg_stop_backup
"terminates the backup mode and performs an automatic switch to the
next WAL segment." Well, obviously not.

And at least to me, that's the real bug here. Somebody needs to go
through and fix this documentation properly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-06 01:24:37 Re: modeling parallel contention (was: Parallel Append implementation)
Previous Message Peter Geoghegan 2017-08-05 23:28:59 Re: Crash report for some ICU-52 (debian8) COLLATE and work_mem values