Re: Switching timeline over streaming replication

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: 'PostgreSQL-development' <pgsql-hackers(at)postgreSQL(dot)org>, 'Thom Brown' <thom(at)linux(dot)com>
Subject: Re: Switching timeline over streaming replication
Date: 2012-12-05 19:22:53
Message-ID: 50BF9F0D.5040608@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05.12.2012 14:32, Amit Kapila wrote:
> On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:
>> After some diversions to fix bugs and refactor existing code, I've
>> committed a couple of small parts of this patch, which just add some
>> sanity checks to notice incorrect PITR scenarios. Here's a new version
>> of the main patch based on current HEAD.
>
> After testing with the new patch, the following problems are observed.
>
> Defect - 1:
>
> 1. start primary A
> 2. start standby B following A
> 3. start cascade standby C following B.
> 4. start another standby D following C.
> 5. Promote standby B.
> 6. After successful time line switch in cascade standby C& D, stop D.
> 7. Restart D, Startup is successful and connecting to standby C.
> 8. Stop C.
> 9. Restart C, startup is failing.

Ok, the error I get in that scenario is:

C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
contain minimum recovery point 0/3023F08 on timeline 1
C 2012-12-05 19:55:43.841 EET 9282 LOG: startup process (PID 9283)
exited with exit code 1
C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to startup
process failure

It seems that the commits I made to master already:

http://archives.postgresql.org/pgsql-committers/2012-12/msg00116.php
http://archives.postgresql.org/pgsql-committers/2012-12/msg00111.php

were a few bricks shy of a load. The problem is that if recovery stops
at a checkpoint record that changes timeline, so that minRecoveryPoint
points to the end of the checkpoint record, we still record the old TLI
as the TLI of minRecoveryPoint. This is because 1) there's a silly bug
in the patch; replayEndTLI is not updated along with replayEndRecPtr.
But even after fixing that, we're not good.

The problem is that replayEndRecPtr is currently updated *before*
calling the redo function. replayEndRecPtr is what becomes
minRecoveryPoint when XLogFlush() is called. If the record is a
checkpoint record, redoing it will switch recovery to the new timeline,
but replayEndTLI will not be updated until the next record.

IOW, as far as minRecoveryPoint is concerned, a checkpoint record that
switches timeline is considered to be part of the old timeline. But when
a server is promoted and a new timeline is created, the checkpoint
record is considered to be part of the new timeline; that's what we
write in the page header and in the control file.

That mismatch causes the error. I'd like to fix this by always treating
the checkpoint record to be part of the new timeline. That feels more
correct. The most straightforward way to implement that would be to peek
at the xlog record before updating replayEndRecPtr and replayEndTLI. If
it's a checkpoint record that changes TLI, set replayEndTLI to the new
timeline before calling the redo-function. But it's a bit of a
modularity violation to peek into the record like that.

Or we could just revert the sanity check at beginning of recovery that
throws the "requested timeline 2 does not contain minimum recovery point
0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
record that says "unexpected timeline ID %u in checkpoint record, before
reaching minimum recovery point %X/%X on timeline %u" checks basically
the same thing, but at a later stage. However, the way
minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
fix that.

I'm thinking of something like the attached (with some more comments
before committing). Thoughts?

- Heikki

Attachment Content-Type Size
fix-minrecoverypointtli.patch text/x-diff 1.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2012-12-05 19:32:22 Re: json accessors
Previous Message Andres Freund 2012-12-05 19:19:27 Re: Dumping an Extension's Script