recovery_target_timeline and multiple slave behavior when master fails

From: Rick Pufky <rick(at)omniti(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: recovery_target_timeline and multiple slave behavior when master fails
Date: 2011-12-15 18:59:10
Message-ID: CAAaAz-JCe86N+5hGtpDvsSrNM4Od6C1t4EzQvAi64oOT3UD9gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I'm attempting to make use of the ability defined by the statement at the
end of the first paragraph in this section of the documentation (
http://www.postgresql.org/docs/9.1/static/warm-standby.html#STANDBY-SERVER-SETUP).
Specifically, "If you plan to have multiple standby servers for high
availability purposes, set recovery_target_timeline to latest, to make the
standby server follow the timeline change that occurs at failover to
another standby."

To attempt this, I've setup 3 pg machines, a master (192.168.56.5) and 2
pitr slaves (192.168.56.6, 192.168.56.7), both connected to the master.
Below, the hostnames will remain the same, even if a slave is promoted to
master, so slave 1 will always refer to 192.168.56.6. I've
compiled/installed pg 9.1.2 and pgbench for testing purposes. I've made
changes to the postgresql.conf file as follows (for ease of testing, using
same postgresql.conf on all servers):
wal_level = hot_standby
archive_mode = off
max_wal_senders = 3
wal_keep_segments = 100
hot_standby = on
hot_standby_feedback = on

And on the slaves, I've got this for recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=192.168.56.5 port=5432 user=postgres'
trigger_file = '/pgsql/omnipitr/finish.recovery'
recovery_target_timeline = latest

I've initialized a fresh DB on master, then executed the following to
create fresh slave copies (with .7 instead of .6 to create slave 2):
pg_start_backup('start');
rsync -avh --exclude=pg_log --exclude=postgresql.conf
--exclude=postmaster.pid . 192.168.56.6:/pgsql/91/data
pg_stop_backup();
rsync -avh --exclude=pg_log --exclude=postgresql.conf
--exclude=postmaster.pid . 192.168.56.6:/pgsql/91/data

Both slaves were brought online and to prove that streaming replication was
working, I created a table and threw some simple data into it. These showed
up on the slaves, so setup appears good so far. For further data on the
master, I used pgbench (time /opt/pgsql/bin/pgbench -i -s 35 test). In the
middle of the pgbench data load (I let it load more than 1 million tuples
first), I killed the postgresql/pgbench processes on the master.
root(at)pg912-master:~# ps -ef |grep post
postgres 1955 1 0 11:39 pts/0 00:00:00 /opt/pgsql/bin/postmaster
-D /pgsql/91/data
postgres 1956 1955 0 11:39 ? 00:00:00 postgres: logger
process
postgres 1958 1955 0 11:39 ? 00:00:00 postgres: writer
process
postgres 1959 1955 0 11:39 ? 00:00:00 postgres: wal writer
process
postgres 1960 1955 0 11:39 ? 00:00:00 postgres: autovacuum
launcher process
postgres 1961 1955 0 11:39 ? 00:00:00 postgres: stats collector
process
postgres 2013 1955 1 11:50 ? 00:00:11 postgres: wal sender
process postgres 192.168.56.6(35184) streaming 0/2B000000
postgres 2015 1955 0 11:50 ? 00:00:02 postgres: wal sender
process postgres 192.168.56.7(44052) streaming 0/2B000000
postgres 2049 1709 1 12:05 pts/1 00:00:00 /opt/pgsql/bin/pgbench -i
-s 35 test
postgres 2050 1955 12 12:05 ? 00:00:00 postgres: postgres test
[local] COPY
root 2052 1602 0 12:05 pts/0 00:00:00 grep post
root(at)pg912-master:~# kill -9 1955 1956 1958 1959 1960 1961 2013 2015 2049
2050

After the master terminated, I promoted slave 1 to become the new master by
touching the trigger file (touch /pgsql/omnipitr/finish.recovery). Some
relevant snippets from slave 1's log file:
2011-12-15 12:05:56.191 EST 2580:LOG: recovery restart point at
0/2D0BE0B0
2011-12-15 12:05:56.191 EST 2580:DETAIL: last completed transaction was
at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:05:57.031 EST 2580:LOG: restartpoint starting: xlog
2011-12-15 12:06:03.067 EST 2580:LOG: restartpoint complete: wrote 98
buffers (3.2%); 0 transaction log file(s) added, 0 removed, 3 recycled;
write=4.907 s, sync=1.101 s, total=6.035 s; sync files=2, longest=1.070 s,
average=0.550 s
2011-12-15 12:06:03.067 EST 2580:LOG: recovery restart point at
0/300BE0B0
2011-12-15 12:06:03.067 EST 2580:DETAIL: last completed transaction was
at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:06:05.148 EST 2581:FATAL: could not receive data from WAL
stream:
2011-12-15 12:06:09.769 EST 2579:LOG: invalid record length at
0/33C92290
2011-12-15 12:06:09.938 EST 2608:FATAL: could not connect to the
primary server: could not connect to server: Connection refused
Is the server running on host "192.168.56.5" and accepting
TCP/IP connections on port 5432?
... more of the same primary server connection refused error ...
2011-12-15 12:08:09.892 EST 2579:LOG: trigger file found:
/pgsql/omnipitr/finish.recovery
2011-12-15 12:08:09.892 EST 2579:LOG: redo done at 0/33C921F0
2011-12-15 12:08:09.892 EST 2579:LOG: last completed transaction was at
log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:08:09.892 EST 2579:LOG: selected new timeline ID: 2
2011-12-15 12:08:10.078 EST 2579:LOG: archive recovery complete
2011-12-15 12:08:10.491 EST 2580:LOG: restartpoint complete: wrote 2692
buffers (87.6%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=112.051 s, sync=0.274 s, total=112.361 s; sync files=2, longest=0.273
s, average=0.137 s
2011-12-15 12:08:10.492 EST 2580:LOG: recovery restart point at
0/330BE0B0
2011-12-15 12:08:10.492 EST 2580:DETAIL: last completed transaction was
at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:08:10.493 EST 2580:LOG: checkpoint starting:
end-of-recovery immediate wait
2011-12-15 12:08:10.497 EST 2580:LOG: checkpoint complete: wrote 0
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=0.000 s, sync=0.000 s, total=0.005 s; sync files=0, longest=0.000 s,
average=0.000 s
2011-12-15 12:08:10.951 EST 2651:LOG: autovacuum launcher started
2011-12-15 12:08:10.960 EST 2577:LOG: database system is ready to
accept connections

Slave 2 was still attempting to connect to the master, so I modified its
recovery.conf file to look like this (connect to slave 1 instead of master)
and restarted pg on slave 2:
standby_mode = 'on'
primary_conninfo = 'host=192.168.56.6 port=5432 user=postgres'
trigger_file = '/pgsql/omnipitr/finish.recovery'
recovery_target_timeline = latest

Relevant log snippets from slave 2:
2011-12-15 12:05:54.992 EST 2626:LOG: recovery restart point at
0/2D0BE0B0
2011-12-15 12:05:54.992 EST 2626:DETAIL: last completed transaction was
at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:05:56.295 EST 2626:LOG: restartpoint starting: xlog
2011-12-15 12:06:05.979 EST 2627:FATAL: could not receive data from WAL
stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

2011-12-15 12:06:06.191 EST 2625:LOG: unexpected pageaddr 0/28E40000 in
log file 0, segment 50, offset 14942208
2011-12-15 12:06:06.218 EST 2626:LOG: restartpoint complete: wrote 185
buffers (6.0%); 0 transaction log file(s) added, 0 removed, 3 recycled;
write=9.692 s, sync=0.222 s, total=9.923 s; sync files=2, longest=0.204 s,
average=0.111 s
2011-12-15 12:06:06.218 EST 2626:LOG: recovery restart point at
0/300BE0B0
2011-12-15 12:06:06.218 EST 2626:DETAIL: last completed transaction was
at log time 2011-12-15 12:05:28.868118-05
2011-12-15 12:06:21.419 EST 2652:FATAL: could not connect to the
primary server: could not connect to server: Connection refused
Is the server running on host "192.168.56.5" and accepting
TCP/IP connections on port 5432?
... more of the same primary server connection refused error ...
2011-12-15 12:08:44.924 EST 2623:LOG: received fast shutdown request
2011-12-15 12:08:44.935 EST 2623:LOG: aborting any active transactions
2011-12-15 12:08:44.935 EST 2626:LOG: shutting down
2011-12-15 12:08:44.945 EST 2626:LOG: database system is shut down
2011-12-15 12:08:46.802 EST 2737:LOG: database system was shut down in
recovery at 2011-12-15 12:08:44 EST
2011-12-15 12:08:46.804 EST 2737:LOG: entering standby mode
2011-12-15 12:08:46.818 EST 2737:LOG: redo starts at 0/300BE0B0
2011-12-15 12:08:47.313 EST 2737:LOG: consistent recovery state reached
at 0/32E3FFF0
2011-12-15 12:08:47.313 EST 2735:LOG: database system is ready to
accept read only connections
2011-12-15 12:08:47.313 EST 2737:LOG: unexpected pageaddr 0/28E40000 in
log file 0, segment 50, offset 14942208
2011-12-15 12:08:47.437 EST 2740:FATAL: timeline 2 of the primary does
not match recovery target timeline 1
2011-12-15 12:08:52.323 EST 2742:FATAL: timeline 2 of the primary does
not match recovery target timeline 1
... repeated continuously ...

Any thoughts on the above snippets? Am I interpreting the documentation
correctly? Is there any further information needed to debug this?

Thanks,
Rick

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Scott Marlowe 2011-12-15 19:01:26 Re: LOCK DATABASE
Previous Message Raghavendra 2011-12-15 18:53:51 Re: null tablespace?