Re: Skip checkpoint on promoting from streaming replication

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: masao(dot)fujii(at)gmail(dot)com
Cc: simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-19 08:30:46
Message-ID: 20120619.173046.88698848.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you.

> What happens if the server skips an end-of-recovery checkpoint,
> is promoted to the master, runs some write transactions,
> crashes and restarts automatically before it completes
> checkpoint? In this case, the server needs to do crash recovery
> from the last checkpoint record with old timeline ID to the
> latest WAL record with new timeline ID. How does crash recovery
> do recovery beyond timeline?

Basically the same as archive recovery as far as I saw. It is
already implemented to work in that way.

After this patch applied, StartupXLOG() gets its
recoveryTargetTLI from the new field lastestTLI in the control
file instead of the latest checkpoint. And the latest checkpoint
record informs its TLI and WAL location as before, but its TLI
does not have a significant meaning in the recovery sequence.

Suggest the case following,

|seg 1 | seg 2 |
TLI 1 |...c......|....000000|
C P X
TLI 2 |........00|

* C - checkpoint, P - promotion, X - crash just after here

This shows the situation that the latest checkpoint(restartpoint)
has been taken at TLI=1/SEG=1/OFF=4 and promoted at
TLI=1/SEG=2/OFF=5, then crashed just after TLI=2/SEG=2/OFF=8.
Promotion itself inserts no wal records but creates a copy of the
current segment for new TLI. the file for TLI=2/SEG=1 should not
exist. (Who will create it?)

The control file will looks as follows

latest checkpoint : TLI=1/SEG=1/OFF=4
latest TLI : 2

So the crash recovery sequence starts from SEG=1/LOC=4.
expectedTLIs will be (2, 1) so 1 will naturally be selected as
the TLI for SEG1 and 2 for SEG2 in XLogFileReadAnyTLI().

In the closer view, startup constructs expectedTLIs reading the
timeline hisotry file corresponds to the recoveryTargetTLI. Then
runs the recovery sequence from the redo point of the latest
checkpoint using WALs with the largest TLI - which is
distinguised by its file name, not header - within the
expectedTLIs in XLogPageRead(). The only difference to archive
recovery is XLogFileReadAnyTLI() reads only the WAL files already
sit in pg_xlog directory, and not reaches archive. The pages with
the new TLI will be naturally picked up as mentioned above in
this sequence and then will stop at the last readable record.

latestTLI field in the control file is updated just after the TLI
was incremented and the new WAL files with the new TLI was
created. So the crash recovery sequence won't stop before
reaching the WAL with new TLI disignated in the control file.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2012-06-19 08:31:09 Re: Pg default's verbosity?
Previous Message Heikki Linnakangas 2012-06-19 08:14:08 Re: WAL format changes