Re: Switching timeline over streaming replication

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc: "'PostgreSQL-development'" <pgsql-hackers(at)postgreSQL(dot)org>, "'Thom Brown'" <thom(at)linux(dot)com>
Subject: Re: Switching timeline over streaming replication
Date: 2012-12-10 12:46:44
Message-ID: 011501cdd6d4$69554860$3bffd920$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> From: Heikki Linnakangas [mailto:hlinnakangas(at)vmware(dot)com]
> Sent: Friday, December 07, 2012 9:22 PM
> To: Amit Kapila
> Cc: 'PostgreSQL-development'; 'Thom Brown'
> Subject: Re: [HACKERS] Switching timeline over streaming replication
>
> On 06.12.2012 15:39, Amit Kapila wrote:
> > On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:
> >> On 05.12.2012 14:32, Amit Kapila wrote:
> >>> On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:
> >>>> After some diversions to fix bugs and refactor existing code, I've
> >>>> committed a couple of small parts of this patch, which just add
> >>>> some sanity checks to notice incorrect PITR scenarios. Here's a new
> >>>> version of the main patch based on current HEAD.
> >>>
> >>> After testing with the new patch, the following problems are
> observed.
> >>>
> >>> Defect - 1:
> >>>
> >>> 1. start primary A
> >>> 2. start standby B following A
> >>> 3. start cascade standby C following B.
> >>> 4. start another standby D following C.
> >>> 5. Promote standby B.
> >>> 6. After successful time line switch in cascade standby C&
> D,
> >> stop D.
> >>> 7. Restart D, Startup is successful and connecting to standby
> C.
> >>> 8. Stop C.
> >>> 9. Restart C, startup is failing.
> >>
> >> Ok, the error I get in that scenario is:
> >>
> >> C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does
> >> not contain minimum recovery point 0/3023F08 on timeline 1 C
> >> 2012-12-05
> >> 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with
> >> exit code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup
> >> due to startup process failure
> >>
> >
> >>
> Well, it seems wrong for the control file to contain a situation like
> this:
>
> pg_control version number: 932
> Catalog version number: 201211281
> Database system identifier: 5819228770976387006
> Database cluster state: shut down in recovery
> pg_control last modified: pe 7. joulukuuta 2012 17.39.57
> Latest checkpoint location: 0/3023EA8
> Prior checkpoint location: 0/2000060
> Latest checkpoint's REDO location: 0/3023EA8
> Latest checkpoint's REDO WAL file: 000000020000000000000003
> Latest checkpoint's TimeLineID: 2
> ...
> Time of latest checkpoint: pe 7. joulukuuta 2012 17.39.49
> Min recovery ending location: 0/3023F08
> Min recovery ending loc's timeline: 1
>
> Note the latest checkpoint location and its TimelineID, and compare them
> with the min recovery ending location. The min recovery ending location
> is ahead of latest checkpoint's location; the min recovery ending
> location actually points to the end of the checkpoint record. But how
> come the min recovery ending location's timeline is 1, while the
> checkpoint record's timeline is 2.
>
> Now maybe that would happen to work if remove the sanity check, but it
> still seems horribly confusing. I'm afraid that discrepancy will come
> back to haunt us later if we leave it like that. So I'd like to fix
> that.
>
> Mulling over this for some more, I propose the attached patch. With the
> patch, we peek into the checkpoint record, and actually perform the
> timeline switch (by changing ThisTimeLineID) before replaying it. That
> way the checkpoint record is really considered to be on the new timeline
> for all purposes. At the moment, the only difference that makes in
> practice is that we set replayEndTLI, and thus minRecoveryPointTLI, to
> the new TLI, but it feels logically more correct to do it that way.

This has fixed both the problems reported in below link:
http://archives.postgresql.org/pgsql-hackers/2012-12/msg00267.php

The code is also fine.

With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit kapila 2012-12-10 12:58:42 Re: Proposal for Allow postgresql.conf values to be changed via SQL
Previous Message Amit Kapila 2012-12-10 12:34:52 Re: Performance Improvement by reducing WAL for Update Operation