Re: [BUG] pg_basebackup from disconnected standby fails

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [BUG] pg_basebackup from disconnected standby fails
Date: 2016-10-21 17:02:21
Message-ID: CA+TgmoaRdyfCqcg9x-E99XCxdC4LW0zygP5qLsPhFsb6BYDKow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 2, 2016 at 8:52 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>> So, if I understand correctly, then we can mark the version posted by
>> you upthread [1] which includes a test along with Kyotaro's fix can be
>> marked as Ready for committer. If so, then please change the status
>> of patch accordingly.
>
> Patch moved to next CF 2016-11, still with status "ready for committer".
>
> IMO, this thread has reached as conclusion to use this patch to
> address the problem reported:
> CAB7nPqTv5gmKQcNDoFGTGqoqXz2xLz4RRw247oqOJzZTVy6-7Q(at)mail(dot)gmail(dot)com

I spent some time reading this thread today and trying to understand
exactly what's going on here. Here's my attempt to restate the
problem:

1. If a restartpoint flushes no dirty buffers, then the redo location
advances but the minimum recovery point does not; therefore, the
minimum recovery point can fall behind the redo location. That's
correct behavior, because if the standby is subsequently restarted, it
can correctly begin at the checkpoint record associated with the
restartpoint and need not replay any further.

2. However, it causes a problem when a base backup with the "include
WAL" option is taken from a standby because -- on standby servers only
-- the end location for a backup as returned by do_pg_stop_backup() is
the minimum recovery point. If this precedes the start point for the
backup -- which is the restart point -- perform_base_backup() will
conclude that no WAL files are required for the backup and fail an
internal sanity check.

3. Removing the sanity check wouldn't help, because it can never be
right to end up with zero WAL files as a result of a base backup. At
a minimum, we need the WAL file which contains the checkpoint record
from which the control file specifies that redo should start.
Actually, I think that checkpoint record could be spread across
multiple files: it might be big. We need them all, but the current
code won't ensure that we get them.

The consensus solution on this thread seems to be that we should have
pg_do_stop_backup() return the last-replayed XLOG location as the
backup end point. If the control file has been updated with a newer
redo location, then the associated checkpoint record has surely been
replayed, so we'll definitely have enough WAL to replay that
checkpoint record (and we don't need to replay anything else, because
we're supposing that this is the case where the minimum recovery point
precedes the redo location). Although this will work, it might
include more WAL in the backup than is strictly necessary. If the
additional WAL replayed beyond the minimum recovery point does NOT
include a checkpoint, we don't need any of it; if it does, we need
only the portion up to and including the last checkpoint record, and
not anything beyond that.

I can think of two solutions that would be "tighter":

1. When performing a restartpoint, update the minimum recovery point
to just beyond the checkpoint record. I think this can't hurt anyone
who is actually restarting recovery on the same machine, because
that's exactly the point where they are going to start recovery. A
minimum recovery point which precedes the point at which they will
start recovery is no better than one which is equal to the point at
which they will start recovery.

2. In perform_base_backup(), if the endptr returned by
do_pg_stop_backup() precedes the end of the checkpoint record returned
by do_pg_start_backup(), use the latter value instead. Unfortunately,
that's not so easy: we can't just say if (endptr < starptr) startptr =
endptr; because startptr is the *start* of the checkpoint record, not
the end. I suspect somebody could figure out a solution to this
problem, though.

If we decide we don't want to aim for one of these tighter solutions
and just adopt the previously-discussed patch, then at the very least
it needs better comments. The minimum comment change the patch makes
right now doesn't give any explanation of why the new algorithm is
better than the old, and that's not cool. Without that, the next
person who comes along to maintain this code won't have any easy way
of knowing that we grappled with this problem and what our conclusions
were.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2016-10-21 17:17:08 Re: Default setting for autovacuum_freeze_max_age
Previous Message Alvaro Herrera 2016-10-21 16:55:50 Re: Default setting for autovacuum_freeze_max_age