Re: Switching timeline over streaming replication

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc: "'PostgreSQL-development'" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Switching timeline over streaming replication
Date: 2012-10-04 13:52:07
Message-ID: 001701cda237$723d8db0$56b8a910$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Wednesday, October 03, 2012 8:45 PM Heikki Linnakangas wrote:
> On Tuesday, October 02, 2012 4:21 PM Heikki Linnakangas wrote:
> > Thanks for the thorough review! I committed the xlog.c refactoring
> patch
> > now. Attached is a new version of the main patch, comments on specific
> > points below. I didn't adjust the docs per your comments yet, will do
> > that next.
>
> I have some doubts regarding the comments fixed by you and some more new
> review comments.
> After this I shall focus majorly towards testing of this Patch.
>

Testing
-----------

Failed Case
--------------
1. promotion of standby to master and follow standby to new master.
2. Stop standby and master. Restart standby first and then master
3. Restart of standby gives below errors
E:\pg_git_code\installation\bin>LOG: database system was shut down in
recovery
at 2012-10-04 18:36:00 IST
LOG: entering standby mode
LOG: consistent recovery state reached at 0/176B800
LOG: redo starts at 0/176B800
LOG: record with zero length at 0/176BD68
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
FATAL: terminating walreceiver process due to administrator command
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0
LOG: out-of-sequence timeline ID 1 (after 2) in log segment
0000000200000000000
00001, offset 0

Once this error comes, restart master/standby in any order or do some
operations on master, always there is above error
On standby.

Passed Cases
-------------
1. After promoting standby as new master, try to make previous master
(having same WAL as new master) as standby.
In this case recovery.conf recovery_target_timeline set to latest. It
ables to connect to new master and started
streaming as per expectation.
- As per expected behavior.

2. After promoting standby as new master, try to make previous master
(having more WAL compare to new master) as standby,
error is displayed.
- As per expected behavior

3. After promoting standby as new master, try to make previous master
(having same WAL as new master) as standby.
In this case recovery.conf recovery_target_timeline is not set. Following
LOG is displayed.
LOG: fetching timeline history file for timeline 2 from primary server
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
LOG: walreceiver ended streaming and awaits new instructions
LOG: re-handshaking at position 0/1000000 on tli 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1
- As per expected behavior

Pending Cases which needs to be tested (these are scenarios, some more
testing I will do based on these scenarios)
---------------------------------------
1. a. Master M-1
b. Standby S-1 follows M-1
c. Standby S-2 follows M-1
d. Promote S-1 as master
e. Try to follow S-2 to S-1 -- operation should be success

2. a. Master M-1
b. Standby S-1 follows M-1
c. Stop S-1, M-1
d. Do the PITR in M-1 2 times. This is to increment timeline in M-1
e. try to follow standby S-1 to M-1 -- it should be success.

3. a. Master M-1
b. Standby S-1, S-2 follows M1
c. Standby S-3, S-4 follows S-1
d. Promote Standby which has highest WAL.
e. follow all standby's to the new master.

4. a. Master M-1
b. Synchronous Standby S-1, S-2
c. Promote S-1
d. Follow M-1, S-2 to S-1 -- this operation should be success.

Concurrent Operations
---------------------------
1. a. Master M-1 , Standby S-1 follows M-1, Standby S-2 follows M-1
b. Many concurrent operations on master M-1
c. During concurrent ops, Promote S-1
d. try S-2 to follow S-1 -- it should happen successfully.

2. During Promotion, call pg_basebackup

3. During Promotion, try to connect client

Resource Testing
------------------
1. a.Make standby follow master which is many time lines ahead
b. Observe if there is any resource leak
c. Allow the streaming replication for 30 mins
d. Observe if there is any resource leak

Code Review
-------------
Libpqrcv_readtimelinehistoryfile()
{
..
..
+ if (PQnfields(res) != 2 || PQntuples(res) != 1)
+ {
+ int ntuples = PQntuples(res);
+ int nfields = PQnfields(res);
+
+ PQclear(res);
+ ereport(ERROR,
+ (errmsg("invalid response from primary
server"),
+ errdetail("Expected 1 tuple with 3 fields,
got %d tuples with %d fields.",
+ ntuples, nfields)));
+ }

..
}

The error message is saying 3 fields needs to be read in timeline history,
but the check seems to be is done for 2 fields.

Kindly let me know if you want me to focus on any other areas for testing
this feature.

With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jon Nelson 2012-10-04 14:19:32 Re: xmalloc => pg_malloc
Previous Message Robert Haas 2012-10-04 13:48:31 Re: Raise a WARNING if a REVOKE affects nothing?