Re: Switching timeline over streaming replication

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc: "'PostgreSQL-development'" <pgsql-hackers(at)postgreSQL(dot)org>, "'Thom Brown'" <thom(at)linux(dot)com>
Subject: Re: Switching timeline over streaming replication
Date: 2012-12-06 13:39:59
Message-ID: 00e101cdd3b7$2ff195d0$8fd4c170$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:
> On 05.12.2012 14:32, Amit Kapila wrote:
> > On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:
> >> After some diversions to fix bugs and refactor existing code, I've
> >> committed a couple of small parts of this patch, which just add some
> >> sanity checks to notice incorrect PITR scenarios. Here's a new
> >> version of the main patch based on current HEAD.
> >
> > After testing with the new patch, the following problems are observed.
> >
> > Defect - 1:
> >
> > 1. start primary A
> > 2. start standby B following A
> > 3. start cascade standby C following B.
> > 4. start another standby D following C.
> > 5. Promote standby B.
> > 6. After successful time line switch in cascade standby C& D,
> stop D.
> > 7. Restart D, Startup is successful and connecting to standby C.
> > 8. Stop C.
> > 9. Restart C, startup is failing.
>
> Ok, the error I get in that scenario is:
>
> C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
> contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
> 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit
> code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to
> startup process failure
>

>
> That mismatch causes the error. I'd like to fix this by always treating
> the checkpoint record to be part of the new timeline. That feels more
> correct. The most straightforward way to implement that would be to peek
> at the xlog record before updating replayEndRecPtr and replayEndTLI. If
> it's a checkpoint record that changes TLI, set replayEndTLI to the new
> timeline before calling the redo-function. But it's a bit of a
> modularity violation to peek into the record like that.
>
> Or we could just revert the sanity check at beginning of recovery that
> throws the "requested timeline 2 does not contain minimum recovery point
> 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
> record that says "unexpected timeline ID %u in checkpoint record, before
> reaching minimum recovery point %X/%X on timeline %u" checks basically
> the same thing, but at a later stage. However, the way
> minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
> fix that.
>
> I'm thinking of something like the attached (with some more comments
> before committing). Thoughts?

This has fixed the problem reported.
However, I am not able to think will there be any problem if we remove check
"requested timeline 2 does not contain minimum recovery point
> 0/3023F08 on timeline 1" at beginning of recovery and just update
replayEndTLI with ThisTimeLineID?

With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-12-06 14:07:32 Re: Commits 8de72b and 5457a1 (COPY FREEZE)
Previous Message Andres Freund 2012-12-06 13:12:56 Re: Commits 8de72b and 5457a1 (COPY FREEZE)