[BUG] pg_basebackup from disconnected standby fails

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [BUG] pg_basebackup from disconnected standby fails
Date: 2016-06-09 12:55:58
Message-ID: 20160609.215558.118976703.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, I found that pg_basebackup from a replication standby
fails after the following steps, on 9.3 and the master.

- start a replication master
- start a replication standby
- stop the master in the mode other than immediate.

pg_basebackup to the standby will fail with the following error.

> pg_basebackup: could not get transaction log end position from
> server: ERROR: could not find any WAL files

The immediate cause is that do_pg_stop_backup returns an ealier
LSN to do_pg_start_backup. The backup start point is the redo
point of the last executed restart point. And the backup end
point is the minRecoveryPoint at the call time.

A restart point doesn't update the minRecoveryPoint when it is
actually executed. Even though, ControlFile->checkPointCopy is
updated to the redo point of the restart point just made. The
minRecoveryPoint is too small as the backup end point on this
situation. Thit is, end point can go behind the start point.

This can be caused by the simple steps above but it also can be
occur when pg_basebackup is connected after master's
disconnection during a restart point. (With some other
timing-dependet condition)

So, the following comment in do_pg_stop_backup says as the
following seems somewhat wrong.

> * We return the current minimum recovery point as the backup end
> * location. Note that it can be greater than the exact backup end
> * location if the minimum recovery point is updated after the backup of
> * pg_control. This is harmless for current uses.

After looking more closely, I found that the minRecoveryPoint
tends to be too small as the backup end point, and up to the
record at the lastReplayedRecPtr can affect the pages on disk and
they can go into the backup just taken.

My conclusion here is that do_pg_stop_backup should return
lastReplayedRecPtr, not minRecoveryPoint.

The attached small patch does this on the master. The first
problem is fixed by this for me.

Any thoughts?

# Sorry, but I'll be offline 'til Monday.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Make-pg_stop_backup-on-standby-give-proper-end-LSN.patch text/x-patch 2.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-06-09 13:37:16 Re: [sqlsmith] Failed assertion in parallel worker (ExecInitSubPlan)
Previous Message Michael Paquier 2016-06-09 12:33:52 Re: LSN as a recovery target