Streaming Replication Failover

From: ning chan <ninchan8328(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Streaming Replication Failover
Date: 2013-01-17 05:17:30
Message-ID: CAG0k5vDu=qkKBWWa=jiSDxhXk6jww3-vPKHLQYq=aTzq9NcF8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,
I have a cluster of 3 nodes Primary is connected by StandbyA (streaming),
Standby A is connected by Standby B (streaming).
I failed over the cluster
1) stop primary
2) promoted StandbyA

Now i see from syslog on Standby B that it is complaining about the
timeline mismatch.

Replication Status from Primary
=============================================
|Parameters | Value |
=============================================
|backend_start | 2013-01-16 23:05:48 |
|pid | 17851 |
|usesysid | 10 |
|usename | postgres |
|application_name | StandbyA |
|client_addr | 10.89.94.31 |
|client_hostname | |
|client_port | 43558 |
|state | streaming |
|sent_location | 0/1EAC3E68 |
|write_location | 0/1EAC3E68 |
|flush_location | 0/1EAC3E68 |
|replay_location | 0/1EAC3E68 |
|sync_priority | 0 |
|sync_state | async |
=============================================

Replication Status from Standby A
=============================================
|Parameters | Value |
=============================================
|backend_start | 2013-01-16 23:06:56 |
|pid | 12320 |
|usesysid | 10 |
|usename | postgres |
|application_name | StandByB |
|client_addr | 10.89.94.29 |
|client_hostname | |
|client_port | 48214 |
|state | streaming |
|sent_location | 0/1EAC3E68 |
|write_location | 0/1EAC3E68 |
|flush_location | 0/1EAC3E68 |
|replay_location | 0/1EAC3E68 |
|sync_priority | 0 |
|sync_state | async |
=============================================

now fail over Primary
On StandByA syslog,
Jan 16 23:08:12 se032c-94-31 postgres[12316]: [3-1] 12316FATAL:
replication terminated by primary server
Jan 16 23:08:12 se032c-94-31 postgres[12312]: [5-1] 12312LOG: redo starts
at 0/1EAC3E68

On StandByB syslog
Jan 16 23:09:48 localhost postgres[3932]: [5-1] LOG: redo starts at
0/1EAC3E68

Now as soon as I promoted the StandByA,
i see replication between A & B is broken, from StandBy B syslog, it shows
the following.
Jan 16 23:11:28 localhost postgres[3945]: [2-1] FATAL: timeline 15 of the
primary does not match recovery target timeline 14

Now my question is while A & B are in sync, why promoting B will break the
replication.

To resolve the problem, I need to do stop the engine on B, rsync from A,
and start back the B engine.
rsync -a --progress --exclude postgresql.conf --exclude recovery.done
--exclude pg_hba.conf root(at)10(dot)89(dot)94(dot)31:/opt/postgres/9.2/data/*
/opt/postgres/9.2/data

Do I need to sync the whole data directory from A? I have a small DB now (2
tables with only few rows). This may take a long time if I have a much
larger DB. Any shortcut? Why do i need to do the rync while A & B are
originally in sync?

Thanks~
Ning

Browse pgsql-general by date

  From Date Subject
Next Message Stuart Bishop 2013-01-17 08:18:09 Re: plpython intermittent ImportErrors
Previous Message Kirk Wythers 2013-01-17 05:15:56 speeding up a join query that utilizes a view