Quick Links

a new standby server promotes itself to primary without cause

From:	John Scalia <jayknowsunix(at)gmail(dot)com>
To:	"pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject:	a new standby server promotes itself to primary without cause
Date:	2015-10-08 13:24:49
Message-ID:	CABzCKRBZguVgWw0hL7QjkEnOabVhEbaV7WDBhkO+-MCD=aNsnQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

Hi all,

We have an ongoing discussion here about retiring slony and implementing
streaming synchronous replication. As part of this effort, I've been trying
to build a small cluster with one primary and one standby. Yes, I know
about the potential problems with a single standby, but this is only to
provide configuration parameters to my sysadmins.

What I've tried is building the two servers on a single sandbox server. The
primary has been running correctly for some time. The standby starts, but
quickly turns itself into a primary without any trigger file being present.
My process is:

1) build the new standby in a different directory with pg_basebackup
2) edit the standby's postgresql.conf so the standby has a different port.
3) start the standby.
4) immediately after starting the standby, I see a recovery.done file in
that PGDATA dir.

Here's part of the debug2 level log file from that standby, but I'm not
seeing why this standby is becoming a primary. Only that it is:

[2015-10-08 07:18:19.448 CDT] @: LOG: 00000: database system is ready to
accept read only connections
[2015-10-08 07:18:19.448 CDT] @: LOCATION: sigusr1_handler,
postmaster.c:4857
cp: cannot stat `/var/lib/pgsql/9.4/share/00000002.history': No such file
or directory
[2015-10-08 07:18:19.477 CDT] @: DEBUG: 00000: could not restore file
"00000002.history" from archive: child process exited with exit code 1
[2015-10-08 07:18:19.477 CDT] @: LOCATION: RestoreArchivedFile,
xlogarchive.c:304
[2015-10-08 07:18:19.477 CDT] @: LOG: 00000: selected new timeline ID: 2
[2015-10-08 07:18:19.477 CDT] @: LOCATION: StartupXLOG, xlog.c:7107
cp: cannot stat `/var/lib/pgsql/9.4/share/00000001.history': No such file
or directory
[2015-10-08 07:18:19.485 CDT] @: DEBUG: 00000: could not restore file
"00000001.history" from archive: child process exited with exit code 1
[2015-10-08 07:18:19.485 CDT] @: LOCATION: RestoreArchivedFile,
xlogarchive.c:304
[2015-10-08 07:18:19.792 CDT] @: LOG: 00000: archive recovery complete
[2015-10-08 07:18:19.792 CDT] @: LOCATION: exitArchiveRecovery, xlog.c:5417
[2015-10-08 07:18:19.798 CDT] @: DEBUG: 00000: performing replication slot
checkpoint
[2015-10-08 07:18:19.798 CDT] @: LOCATION: CheckPointReplicationSlots,
slot.c:794
[2015-10-08 07:18:19.814 CDT] @: DEBUG: 00000: attempting to remove WAL
segments older than log file 0000000000000006000000D8
[2015-10-08 07:18:19.814 CDT] @: LOCATION: RemoveOldXlogFiles, xlog.c:3775
[2015-10-08 07:18:19.814 CDT] @: DEBUG: 00000: SlruScanDirectory invoking
callback on pg_multixact/offsets/0000
[2015-10-08 07:18:19.814 CDT] @: LOCATION: SlruScanDirectory, slru.c:1307
[2015-10-08 07:18:19.815 CDT] @: DEBUG: 00000: SlruScanDirectory invoking
callback on pg_multixact/members/0000
[2015-10-08 07:18:19.815 CDT] @: LOCATION: SlruScanDirectory, slru.c:1307
[2015-10-08 07:18:19.815 CDT] @: DEBUG: 00000: SlruScanDirectory invoking
callback on pg_multixact/offsets/0000
[2015-10-08 07:18:19.815 CDT] @: LOCATION: SlruScanDirectory, slru.c:1307
[2015-10-08 07:18:19.893 CDT] @: DEBUG: 00000: attempting to remove WAL
segments newer than log file 0000000200000006000000E1
[2015-10-08 07:18:19.893 CDT] @: LOCATION: RemoveNonParentXlogFiles,
xlog.c:5458
[2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: oldest MultiXactId member
is at offset 1
[2015-10-08 07:18:19.899 CDT] @: LOCATION: SetOffsetVacuumLimit,
multixact.c:2677
[2015-10-08 07:18:19.899 CDT] @: LOG: 00000: MultiXact member wraparound
protections are now enabled
[2015-10-08 07:18:19.899 CDT] @: LOCATION: DetermineSafeOldestOffset,
multixact.c:2587
[2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: MultiXact member stop limit
is now 4294914944 based on MultiXact 1
[2015-10-08 07:18:19.899 CDT] @: LOCATION: DetermineSafeOldestOffset,
multixact.c:2590
[2015-10-08 07:18:19.899 CDT] @: DEBUG: 00000: release all standby locks
[2015-10-08 07:18:19.899 CDT] @: LOCATION: StandbyReleaseAllLocks,
standby.c:666
[2015-10-08 07:18:19.903 CDT] @: LOG: 00000: starting background worker
process "powa"
[2015-10-08 07:18:19.904 CDT] @: LOCATION: do_start_bgworker,
postmaster.c:5412
[2015-10-08 07:18:19.905 CDT] @: LOG: 00000: database system is ready to
accept connections

The one thing that's a little weird on this configuration is that both the
primary and the standby are on the same system, and I've never done a
configuration this way before. Usually, these are on different systems, but
this was done just to test some configuration parameters. Can a standby and
a primary not exist together on the same system? In any event, there is no
trigger file as specified in the recovery.conf on this system. So, why is
promotion occuring?
--
Jay

Responses

Re: a new standby server promotes itself to primary without cause at 2015-10-08 14:51:45 from Keith Fiske
Re: a new standby server promotes itself to primary without cause at 2015-10-08 14:54:36 from Scott Mead

Browse pgsql-admin by date

	From	Date	Subject
Next Message	rafael.burischipfer	2015-10-08 14:35:32	Doubts PostgreSQL
Previous Message	Evgeniy Losev	2015-10-08 09:46:24	Moving tablespace pg_global to custom location