Re: Avoiding shutdown checkpoint at failover

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding shutdown checkpoint at failover
Date: 2012-01-26 05:27:48
Message-ID: CAHGQGwH2rOZjMa_-iPCB=X6=5LbLxSf45o5SSR04YDkJccDz8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 20, 2012 at 12:33 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Jan 18, 2012 at 7:15 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Sun, Nov 13, 2011 at 5:13 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> On Tue, Nov 1, 2011 at 12:11 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>
>>>> When I say skip the shutdown checkpoint, I mean remove it from the
>>>> critical path of required actions at the end of recovery. We can still
>>>> have a normal checkpoint kicked off at that time, but that no longer
>>>> needs to be on the critical path.
>>>>
>>>> Any problems foreseen? If not, looks like a quick patch.
>>>
>>> Patch attached for discussion/review.
>>
>> This feature is what I want, and very helpful to shorten the failover time in
>> streaming replication.
>>
>> Here are the review comments. Though I've not checked enough whether
>> this feature works fine in all recovery patterns yet.
>>
>> LocalSetXLogInsertAllowed() must be called before LogEndOfRecovery().
>> LocalXLogInsertAllowed must be set to -1 after LogEndOfRecovery().
>>
>> XLOG_END_OF_RECOVERY record is written to the WAL file with new
>> assigned timeline ID. But it must be written to the WAL file with old one.
>> Otherwise, when re-entering a recovery after failover, we cannot find
>> XLOG_END_OF_RECOVERY record at all.
>>
>> Before XLOG_END_OF_RECOVERY record is written,
>> RmgrTable[rmid].rm_cleanup() might write WAL records. They also
>> should be written to the WAL file with old timeline ID.
>>
>> When recovery target is specified, we cannot write new WAL to the file
>> with old timeline because which means that valid WAL records in it are
>> overwritten with new WAL. So when recovery target is specified,
>> ISTM that we cannot skip end of recovery checkpoint. Or we might need
>> to save all information about timelines in the database cluster instead
>> of writing XLOG_END_OF_RECOVERY record, and use it when re-entering
>> a recovery.
>>
>> LogEndOfRecovery() seems to need to call XLogFlush(). Otherwise,
>> what if the server crashes after new timeline history file is created and
>> recovery.conf is removed, but before XLOG_END_OF_RECOVERY record
>> has not been flushed to the disk yet?
>>
>> During recovery, when we replay XLOG_END_OF_RECOVERY record, we
>> should close the currently-opened WAL file and read the WAL file with
>> the timeline which XLOG_END_OF_RECOVERY record indicates.
>> Otherwise, when re-entering a recovery with old timeline, we cannot
>> reach new timeline.
>
>
>
> OK, some bad things there, thanks for the insightful comments.
>
>
>
> I think you're right that we can't skip the checkpoint if xlog_cleanup
> writes WAL records, since that implies at least one and maybe more
> blocks have changed and need to be flushed. That can be improved upon,
> but not now in 9.2.Cleanup WAL is written in either the old or the new
> timeline, depending upon whether we increment it. So we don't need to
> change anything there, IMHO.
>
> The big problem is how we handle crash recovery after we startup
> without a checkpoint. No quick fixes there.
>
> So let me rethink this: The idea was that we can skip the checkpoint
> if we promote to normal running during streaming replication.
>
> WALReceiver has been writing to WAL files, so can write more data
> without all of the problems noted. Rather than write the
> XLOG_END_OF_RECOVERY record via XLogInsert we should write that **from
> the WALreceiver** as a dummy record by direct injection into the WAL
> stream. So the Startup process sees a WAL record that looks like it
> was written by the primary saying "promote yourself", although it was
> actually written locally by WALreceiver when requested to shutdown.
> That doesn't damage anything because we know we've received all the
> WAL there is. Most importantly we don't need to change any of the
> logic in a way that endangers the other code paths at end of recovery.
>
> Writing the record in that way means we would need to calculate the
> new tli slightly earlier, so we can input the correct value into the
> record. That also solves the problem of how to get additional standbys
> to follow the new master. The XLOG_END_OF_RECOVERY record is simply
> the contents of the newly written tli history file.
>
> If we skip the checkpoint and then crash before the next checkpoint we
> just change timeline when we see XLOG_END_OF_RECOVERY. When we replay
> the XLOG_END_OF_RECOVERY we copy the contents to the appropriate tli
> file and then switch to it.
>
> So this solves 2 problems: having other standbys follow us when they
> don't have archiving, and avoids the checkpoint.
>
> Let me know what you think.

Looks good to me.

One thing I would like to ask is that why you think walreceiver is more
appropriate for writing XLOG_END_OF_RECOVERY record than startup
process. I was thinking the opposite, because if we do so, we might be
able to skip the end-of-recovery checkpoint even in file-based log-shipping
case.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-01-26 06:09:33 Re: Online base backup from the hot-standby
Previous Message Tom Lane 2012-01-26 04:53:10 Re: Second thoughts on CheckIndexCompatible() vs. operator families