Re: a new standby server promotes itself to primary without cause

From: John Scalia <jayknowsunix(at)gmail(dot)com>
To: Keith Fiske <keith(at)omniti(dot)com>
Cc: "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: a new standby server promotes itself to primary without cause
Date: 2015-10-08 14:59:23
Message-ID: CABzCKRA0iCFzn6Og56jwbmF4P_44J7r0tEndMJPzt9xU74oC+A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Yep, they're both in different directories and on different ports. Allow
me to note, that I have both handbuilt this file from the sample provided
in /usr/pgsql-9.4/share and used pg_basebackup's -R option to automatically
generate it. Doesn't seem to make a difference. Here's the recovery.conf
file I'm using on the standby.

restore_command = 'cp /var/lib/pgsql/9.4/share/%f %p' # e.g. 'cp
/mnt/server/archivedir/%f %p'
archive_cleanup_command = '/usr/pgsql-9.4/bin/pg_archivecleanup
/var/lib/pgsql/9.4/share %r'
#recovery_end_command = ''
#recovery_target_name = '' # e.g. 'daily backup 2011-01-26'
#recovery_target_time = '' # e.g. '2004-07-14 22:39:00 EST'
#recovery_target_xid = ''
#recovery_target_inclusive = true
recovery_target = 'immediate'
pause_at_recovery_target = false
standby_mode = on
primary_conninfo = 'host=10.32.10.175 port=5432 user=replicator'
# e.g. 'host=localhost port=5432'
primary_slot_name = 'replicant'
trigger_file = '/var/lib/pgsql/9.4/share/trigger_file'
#recovery_min_apply_delay = 0

I have also gone through the master's postgresql.conf file several times
looking for problems, and I've verified the replication slot exists.

Thanks for your assistance,
Jay

On Thu, Oct 8, 2015 at 9:51 AM, Keith Fiske <keith(at)omniti(dot)com> wrote:

>
>
>
>
> On Thu, Oct 8, 2015 at 9:24 AM, John Scalia <jayknowsunix(at)gmail(dot)com>
> wrote:
>
>> Hi all,
>>
>> We have an ongoing discussion here about retiring slony and implementing
>> streaming synchronous replication. As part of this effort, I've been trying
>> to build a small cluster with one primary and one standby. Yes, I know
>> about the potential problems with a single standby, but this is only to
>> provide configuration parameters to my sysadmins.
>>
>> What I've tried is building the two servers on a single sandbox server.
>> The primary has been running correctly for some time. The standby starts,
>> but quickly turns itself into a primary without any trigger file being
>> present. My process is:
>>
>> 1) build the new standby in a different directory with pg_basebackup
>> 2) edit the standby's postgresql.conf so the standby has a different port.
>> 3) start the standby.
>> 4) immediately after starting the standby, I see a recovery.done file in
>> that PGDATA dir.
>>
>> Here's part of the debug2 level log file from that standby, but I'm not
>> seeing why this standby is becoming a primary. Only that it is:
>>
>> [2015-10-08 07:18:19.448 CDT] @: LOG: 00000: database system is ready to
>> accept read only connections
>> [2015-10-08 07:18:19.448 CDT] @: LOCATION: sigusr1_handler,
>> postmaster.c:4857
>> cp: cannot stat `/var/lib/pgsql/9.4/share/00000002.history': No such file
>> or directory
>> [2015-10-08 07:18:19.477 CDT] @: DEBUG: 00000: could not restore file
>> "00000002.history" from archive: child process exited with exit code 1
>> [2015-10-08 07:18:19.477 CDT] @: LOCATION: RestoreArchivedFile,
>> xlogarchive.c:304
>> [2015-10-08 07:18:19.477 CDT] @: LOG: 00000: selected new timeline ID: 2
>> [2015-10-08 07:18:19.477 CDT] @: LOCATION: StartupXLOG, xlog.c:7107
>> cp: cannot stat `/var/lib/pgsql/9.4/share/00000001.history': No such file
>> or directory
>> [2015-10-08 07:18:19.485 CDT] @: DEBUG: 00000: could not restore file
>> "00000001.history" from archive: child process exited with exit code 1
>> [2015-10-08 07:18:19.485 CDT] @: LOCATION: RestoreArchivedFile,
>> xlogarchive.c:304
>> [2015-10-08 07:18:19.792 CDT] @: LOG: 00000: archive recovery complete
>> [2015-10-08 07:18:19.792 CDT] @: LOCATION: exitArchiveRecovery,
>> xlog.c:5417
>> [2015-10-08 07:18:19.798 CDT] @: DEBUG: 00000: performing replication
>> slot checkpoint
>> [2015-10-08 07:18:19.798 CDT] @: LOCATION: CheckPointReplicationSlots,
>> slot.c:794
>> [2015-10-08 07:18:19.814 CDT] @: DEBUG: 00000: attempting to remove WAL
>> segments older than log file 0000000000000006000000D8
>> [2015-10-08 07:18:19.814 CDT] @: LOCATION: RemoveOldXlogFiles,
>> xlog.c:3775
>> [2015-10-08 07:18:19.814 CDT] @: DEBUG: 00000: SlruScanDirectory
>> invoking callback on pg_multixact/offsets/0000
>> [2015-10-08 07:18:19.814 CDT] @: LOCATION: SlruScanDirectory, slru.c:1307
>> [2015-10-08 07:18:19.815 CDT] @: DEBUG: 00000: SlruScanDirectory
>> invoking callback on pg_multixact/members/0000
>> [2015-10-08 07:18:19.815 CDT] @: LOCATION: SlruScanDirectory, slru.c:1307
>> [2015-10-08 07:18:19.815 CDT] @: DEBUG: 00000: SlruScanDirectory
>> invoking callback on pg_multixact/offsets/0000
>> [2015-10-08 07:18:19.815 CDT] @: LOCATION: SlruScanDirectory, slru.c:1307
>> [2015-10-08 07:18:19.893 CDT] @: DEBUG: 00000: attempting to remove WAL
>> segments newer than log file 0000000200000006000000E1
>> [2015-10-08 07:18:19.893 CDT] @: LOCATION: RemoveNonParentXlogFiles,
>> xlog.c:5458
>> [2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: oldest MultiXactId member
>> is at offset 1
>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: SetOffsetVacuumLimit,
>> multixact.c:2677
>> [2015-10-08 07:18:19.899 CDT] @: LOG: 00000: MultiXact member wraparound
>> protections are now enabled
>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: DetermineSafeOldestOffset,
>> multixact.c:2587
>> [2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: MultiXact member stop
>> limit is now 4294914944 based on MultiXact 1
>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: DetermineSafeOldestOffset,
>> multixact.c:2590
>> [2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: release all standby locks
>> [2015-10-08 07:18:19.899 CDT] @: LOCATION: StandbyReleaseAllLocks,
>> standby.c:666
>> [2015-10-08 07:18:19.903 CDT] @: LOG: 00000: starting background worker
>> process "powa"
>> [2015-10-08 07:18:19.904 CDT] @: LOCATION: do_start_bgworker,
>> postmaster.c:5412
>> [2015-10-08 07:18:19.905 CDT] @: LOG: 00000: database system is ready to
>> accept connections
>>
>> The one thing that's a little weird on this configuration is that both
>> the primary and the standby are on the same system, and I've never done a
>> configuration this way before. Usually, these are on different systems, but
>> this was done just to test some configuration parameters. Can a standby and
>> a primary not exist together on the same system? In any event, there is no
>> trigger file as specified in the recovery.conf on this system. So, why is
>> promotion occuring?
>> --
>> Jay
>>
>
>
> Long as they're on different ports and using different data directories,
> you should be fine.
> Can you share the contents of your recovery.conf file?
>
> --
> Keith Fiske
> Database Administrator
> OmniTI Computer Consulting, Inc.
> http://www.keithf4.com
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Mead 2015-10-08 15:04:00 Re: a new standby server promotes itself to primary without cause
Previous Message Scott Mead 2015-10-08 14:54:36 Re: a new standby server promotes itself to primary without cause