Re: BUG #7500: hot-standby replica crash after an initial rsync

From: Stuart Bishop <stuart(at)stuartbishop(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync
Date: 2012-08-29 16:38:54
Message-ID: CADmi=6P6VT=6sW0XjW6cn35bW_uW=365TU8_Ssbd8Oepn-Cacw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Aug 29, 2012 at 10:59 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On Wednesday, August 29, 2012 05:32:31 PM Stuart Bishop wrote:
>> I believe I just hit this same issue, but with PG 9.1.3:
>>
>> <@:32407> 2012-08-29 10:02:09 UTC LOG: shutting down
>> <@:32407> 2012-08-29 10:02:09 UTC LOG: database system is shut down
>> <[unknown](at)[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: connection
>> received: host=[local]
>> <[unknown](at)[unknown]:31687> 2012-08-29 13:34:03 UTC LOG: incomplete
>> startup packet
>> <@:31686> 2012-08-29 13:34:03 UTC LOG: database system was
>> interrupted; last known up at 2012-08-29 13:14:47 UTC
>> <@:31686> 2012-08-29 13:34:03 UTC LOG: entering standby mode
>> <@:31686> 2012-08-29 13:34:03 UTC LOG: redo starts at A92/5F000020
>> <@:31686> 2012-08-29 13:34:03 UTC FATAL: could not access status of
>> transaction 208177034
>> <@:31686> 2012-08-29 13:34:03 UTC DETAIL: Could not read from file
>> "pg_multixact/offsets/0C68" at offset 131072: Success.
>> <@:31686> 2012-08-29 13:34:03 UTC CONTEXT: xlog redo create multixact
>> 208177034 offset 1028958730: 1593544329 1593544330
>> <@:31681> 2012-08-29 13:34:03 UTC LOG: startup process (PID 31686)
>> exited with exit code 1
>> <@:31681> 2012-08-29 13:34:03 UTC LOG: terminating any other active
>> server processes
>>
>> This was attempting to rebuild a hot standby after switching my master
>> to a new server. In between the shutdown and the attempt to restart:
>>
>> - The master was put into backup mode.
>> - The datadir was rsynced over, using rsync -ahhP --delete-before
>> --exclude=postmaster.pid --exclude=pg_xlog
>> - The master was taken out of backup mode.
>> - The pg_xlog directory was emptied
>> - The pg_xlog directory was rsynced across from the master. This
>> included all the WAL files from before the promotion, throughout
>> backup mode, and a few from after backup mode was left.
> Thats not valid, you cannot easily guarantee that youve not copied files that
> were in the progress of being written to. Use a recovery_command if you do not
> want all files to be transferred via the replication connection. But do that
> only for files that have been archived via an archive_command beforehand.

Ok. I had assumed this was fine, as the docs explicitly tell me to
copy across any unarchived WAL files when doing failover. I think my
confusion is because the docs for building a standby refer to the
section on recovering from a backup, but I have a live server.

I'll just let the WAL files get sucked over the replication connection
if that works - this seems much simpler. I don't think I saw this
mentioned in the docs. I had been assuming enough WAL needed to be
available to bring the DB up to a consistent state before streaming
replication would start.

> Did you have a backup label in the rsync'ed datadir? In Maxim's case I could
> detect that he had not via line numbers, but I do not see them here...

Yes, the backup_label copied across (confirmed in scrollback from the rsync).

--
Stuart Bishop <stuart(at)stuartbishop(dot)net>
http://www.stuartbishop.net/

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrew Hastie 2012-08-29 16:43:56 Re: BUG #6758: ./configure script sets HAVE_WCSTOMBS_L 1
Previous Message lacmane 2012-08-29 16:08:38 PostGreSQL pgdac - C++ Builder 2007