Re: Switching timeline over streaming replication

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc: "'PostgreSQL-development'" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Switching timeline over streaming replication
Date: 2012-11-16 14:01:23
Message-ID: 00bd01cdc402$dd1c4ad0$9754e070$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, November 15, 2012 6:05 PM Heikki Linnakangas wrote:
> On 15.11.2012 12:44, Heikki Linnakangas wrote:
> > Here's an updated version of this patch, rebased with master,
> > including the recent replication timeout changes, and some other
> cleanup.
> >
> > On 12.10.2012 09:34, Amit Kapila wrote:
> >> The test is finished from myside.
> >>
> >> one more issue:
> > > ...
> >> ./pg_basebackup -P -D ../../data_sub -X fetch -p 2303
> >> pg_basebackup: COPY stream ended before last file was finished
> >
> > Fixed this.
> >
> > However, the test scenario you point to here:
> > http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e77
> > 10$(at)kapila@huawei.com still seems to be broken, although I get a
> > different error message now.
> > I'll dig into this..
>
> Ok, here's an updated patch again, with that bug fixed.

First, I started with test of this Patch.

Basic stuff:
------------
- Patch applies OK
- Compiles cleanly with no warnings
- Regression tests pass except the "standbycheck".

From a glance view of the "standbycheck" regression failures are because of
sql scripts and expected outputs are little old.

The following problems are observed while testing of the patch.
Defect-1:

1. start primary A
2. start standby B following A
3. start cascade standby C following B.
4. Promote standby B.
5. After successful time line switch in cascade standby C, stop C.
6. Restart C, startup is failing with the following error.

LOG: database system was shut down in recovery at 2012-11-16
16:26:29 IST
FATAL: requested timeline 2 does not contain minimum recovery point
0/30143A0 on timeline 1
LOG: startup process (PID 415) exited with exit code 1
LOG: aborting startup due to startup process failure

The above defect is already discussed in the following link.
http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@ka
pila(at)huawei(dot)com

Defect-2:

1. start primary A
2. start standby B following A
3. start cascade standby C following B with 'recovery_target_timeline'
option in
recovery.conf is disabled.
4. Promote standby B.
5. Cascade Standby C is not able to follow the new master B because of
timeline difference.
6. Try to stop the cascade standby C (which is failing and the
server is not stopping,
observations are as WAL Receiver process is still running and
clients are not allowing to connect).

The defect-2 is happened only once in my test environment, I will try to
reproduce it.

With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2012-11-16 14:03:04 Re: WIP patch for hint bit i/o mitigation
Previous Message Markus Wanner 2012-11-16 13:46:39 Re: logical changeset generation v3 - comparison to Postgres-R change set format