Re: a new standby server promotes itself to primary without cause

From: Scott Mead <scottm(at)openscg(dot)com>
To: John Scalia <jayknowsunix(at)gmail(dot)com>
Cc: Keith Fiske <keith(at)omniti(dot)com>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: a new standby server promotes itself to primary without cause
Date: 2015-10-08 15:04:00
Message-ID: CAKq0gvJHhgXP_mFq75rV7-ByxV9Xfv2nNomb_k+xbg7Mmz4YRA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Thu, Oct 8, 2015 at 10:59 AM, John Scalia <jayknowsunix(at)gmail(dot)com> wrote:

> Yep, they're both in different directories and on different ports. Allow
> me to note, that I have both handbuilt this file from the sample provided
> in /usr/pgsql-9.4/share and used pg_basebackup's -R option to automatically
> generate it. Doesn't seem to make a difference. Here's the recovery.conf
> file I'm using on the standby.
>
> restore_command = 'cp /var/lib/pgsql/9.4/share/%f %p' # e.g. 'cp
> /mnt/server/archivedir/%f %p'
> archive_cleanup_command = '/usr/pgsql-9.4/bin/pg_archivecleanup
> /var/lib/pgsql/9.4/share %r'
> #recovery_end_command = ''
> #recovery_target_name = '' # e.g. 'daily backup 2011-01-26'
> #recovery_target_time = '' # e.g. '2004-07-14 22:39:00 EST'
> #recovery_target_xid = ''
> #recovery_target_inclusive = true
> recovery_target = 'immediate'
>

I believe you should omit recovery_target. From
http://www.postgresql.org/docs/9.4/static/recovery-target-settings.html

recovery_target = 'immediate'

This parameter specifies that recovery should end as soon as a consistent
state is reached, i.e. as early as possible. When restoring from an online
backup, this means the point where taking the backup ended.

--
Scott Mead
Sr. Architect
OpenSCG
PostgreSQL, Java & Linux Experts

http://openscg.com

pause_at_recovery_target = false
> standby_mode = on
> primary_conninfo = 'host=10.32.10.175 port=5432 user=replicator'
> # e.g. 'host=localhost port=5432'
> primary_slot_name = 'replicant'
> trigger_file = '/var/lib/pgsql/9.4/share/trigger_file'
> #recovery_min_apply_delay = 0
>
> I have also gone through the master's postgresql.conf file several times
> looking for problems, and I've verified the replication slot exists.
>
> Thanks for your assistance,
> Jay
>
> On Thu, Oct 8, 2015 at 9:51 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:
>
>>
>>
>>
>>
>> On Thu, Oct 8, 2015 at 9:24 AM, John Scalia <jayknowsunix(at)gmail(dot)com>
>> wrote:
>>
>>> Hi all,
>>>
>>> We have an ongoing discussion here about retiring slony and implementing
>>> streaming synchronous replication. As part of this effort, I've been trying
>>> to build a small cluster with one primary and one standby. Yes, I know
>>> about the potential problems with a single standby, but this is only to
>>> provide configuration parameters to my sysadmins.
>>>
>>> What I've tried is building the two servers on a single sandbox server.
>>> The primary has been running correctly for some time. The standby starts,
>>> but quickly turns itself into a primary without any trigger file being
>>> present. My process is:
>>>
>>> 1) build the new standby in a different directory with pg_basebackup
>>> 2) edit the standby's postgresql.conf so the standby has a different
>>> port.
>>> 3) start the standby.
>>> 4) immediately after starting the standby, I see a recovery.done file in
>>> that PGDATA dir.
>>>
>>> Here's part of the debug2 level log file from that standby, but I'm not
>>> seeing why this standby is becoming a primary. Only that it is:
>>>
>>> [2015-10-08 07:18:19.448 CDT] @: LOG: 00000: database system is ready
>>> to accept read only connections
>>> [2015-10-08 07:18:19.448 CDT] @: LOCATION: sigusr1_handler,
>>> postmaster.c:4857
>>> cp: cannot stat `/var/lib/pgsql/9.4/share/00000002.history': No such
>>> file or directory
>>> [2015-10-08 07:18:19.477 CDT] @: DEBUG: 00000: could not restore file
>>> "00000002.history" from archive: child process exited with exit code 1
>>> [2015-10-08 07:18:19.477 CDT] @: LOCATION: RestoreArchivedFile,
>>> xlogarchive.c:304
>>> [2015-10-08 07:18:19.477 CDT] @: LOG: 00000: selected new timeline ID: 2
>>> [2015-10-08 07:18:19.477 CDT] @: LOCATION: StartupXLOG, xlog.c:7107
>>> cp: cannot stat `/var/lib/pgsql/9.4/share/00000001.history': No such
>>> file or directory
>>> [2015-10-08 07:18:19.485 CDT] @: DEBUG: 00000: could not restore file
>>> "00000001.history" from archive: child process exited with exit code 1
>>> [2015-10-08 07:18:19.485 CDT] @: LOCATION: RestoreArchivedFile,
>>> xlogarchive.c:304
>>> [2015-10-08 07:18:19.792 CDT] @: LOG: 00000: archive recovery complete
>>> [2015-10-08 07:18:19.792 CDT] @: LOCATION: exitArchiveRecovery,
>>> xlog.c:5417
>>> [2015-10-08 07:18:19.798 CDT] @: DEBUG: 00000: performing replication
>>> slot checkpoint
>>> [2015-10-08 07:18:19.798 CDT] @: LOCATION: CheckPointReplicationSlots,
>>> slot.c:794
>>> [2015-10-08 07:18:19.814 CDT] @: DEBUG: 00000: attempting to remove WAL
>>> segments older than log file 0000000000000006000000D8
>>> [2015-10-08 07:18:19.814 CDT] @: LOCATION: RemoveOldXlogFiles,
>>> xlog.c:3775
>>> [2015-10-08 07:18:19.814 CDT] @: DEBUG: 00000: SlruScanDirectory
>>> invoking callback on pg_multixact/offsets/0000
>>> [2015-10-08 07:18:19.814 CDT] @: LOCATION: SlruScanDirectory,
>>> slru.c:1307
>>> [2015-10-08 07:18:19.815 CDT] @: DEBUG: 00000: SlruScanDirectory
>>> invoking callback on pg_multixact/members/0000
>>> [2015-10-08 07:18:19.815 CDT] @: LOCATION: SlruScanDirectory,
>>> slru.c:1307
>>> [2015-10-08 07:18:19.815 CDT] @: DEBUG: 00000: SlruScanDirectory
>>> invoking callback on pg_multixact/offsets/0000
>>> [2015-10-08 07:18:19.815 CDT] @: LOCATION: SlruScanDirectory,
>>> slru.c:1307
>>> [2015-10-08 07:18:19.893 CDT] @: DEBUG: 00000: attempting to remove WAL
>>> segments newer than log file 0000000200000006000000E1
>>> [2015-10-08 07:18:19.893 CDT] @: LOCATION: RemoveNonParentXlogFiles,
>>> xlog.c:5458
>>> [2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: oldest MultiXactId
>>> member is at offset 1
>>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: SetOffsetVacuumLimit,
>>> multixact.c:2677
>>> [2015-10-08 07:18:19.899 CDT] @: LOG: 00000: MultiXact member
>>> wraparound protections are now enabled
>>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: DetermineSafeOldestOffset,
>>> multixact.c:2587
>>> [2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: MultiXact member stop
>>> limit is now 4294914944 based on MultiXact 1
>>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: DetermineSafeOldestOffset,
>>> multixact.c:2590
>>> [2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: release all standby locks
>>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: StandbyReleaseAllLocks,
>>> standby.c:666
>>> [2015-10-08 07:18:19.903 CDT] @: LOG: 00000: starting background worker
>>> process "powa"
>>> [2015-10-08 07:18:19.904 CDT] @: LOCATION: do_start_bgworker,
>>> postmaster.c:5412
>>> [2015-10-08 07:18:19.905 CDT] @: LOG: 00000: database system is ready
>>> to accept connections
>>>
>>> The one thing that's a little weird on this configuration is that both
>>> the primary and the standby are on the same system, and I've never done a
>>> configuration this way before. Usually, these are on different systems, but
>>> this was done just to test some configuration parameters. Can a standby and
>>> a primary not exist together on the same system? In any event, there is no
>>> trigger file as specified in the recovery.conf on this system. So, why is
>>> promotion occuring?
>>> --
>>> Jay
>>>
>>
>>
>> Long as they're on different ports and using different data directories,
>> you should be fine.
>> Can you share the contents of your recovery.conf file?
>>
>> --
>> Keith Fiske
>> Database Administrator
>> OmniTI Computer Consulting, Inc.
>> http://www.keithf4.com
>>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message John Scalia 2015-10-08 15:43:09 Re: a new standby server promotes itself to primary without cause
Previous Message John Scalia 2015-10-08 14:59:23 Re: a new standby server promotes itself to primary without cause