Re: Fast promotion failure

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fast promotion failure
Date: 2013-05-19 14:25:08
Message-ID: CA+U5nMKp9KvhKTCEgQK2MMRU8do9HJBj-jhorfNpsAepApzQEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7 May 2013 10:57, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> While testing the bug from the "Assertion failure at standby promotion", I
> bumped into a different bug in fast promotion. When the first checkpoint
> after fast promotion is performed, there is no guarantee that the
> checkpointer process is running with the correct, new, ThisTimeLineID. In
> CreateCheckPoint(), we have this:
>
>> /*
>> * An end-of-recovery checkpoint is created before anyone is
>> allowed to
>> * write WAL. To allow us to write the checkpoint record,
>> temporarily
>> * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
>> * initialized, which we need here and in AdvanceXLInsertBuffer.)
>> */
>> if (flags & CHECKPOINT_END_OF_RECOVERY)
>> LocalSetXLogInsertAllowed();
>
>
> That ensures that ThisTimeLineID is updated when performing an
> end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
> The consequence is that the checkpoint is created with the old timeline, and
> subsequent recovery from it will fail.

LocalSetXLogInsertAllowed() is called by CreateEndOfRecoveryRecord(),
but in fast promotion this is called by Startup process, not
Checkpointer process. So there is no longer an explicit call to
InitXLOGAccess() in the checkpointer process via a checkpoint with
flag CHECKPOINT_END_OF_RECOVERY.

However, there is a call to RecoveryInProgress() at the top of the
main loop of the checkpointer, which does explicitly state that it
"initializes TimeLineID if it's not set yet." The checkpointer makes
the decision about whether to run a restartpoint or a checkpoint
directly from the answer to that, modified only in the case of an
CHECKPOINT_END_OF_RECOVERY.

So the appropiate call is made and I don't agree with the statement
that this "doesn't get executed with fast promotion", or the title of
the thread.

So while I believe that the checkpointer might have an incorrect TLI
and that you've seen a bug, what isn't clear is that the checkpointer
is the only process that would see an incorrect TLI, or why such
processes see an incorrect TLI. It seems equally likely at this point
that the TLI may be set incorrectly somehow and that is why it is
being read incorrectly.

I see that the comment in InitXLOGAccess() is incorrect
"ThisTimeLineID doesn't change so we need no lock to copy it", which
seems worth correcting since there's no need to save time in a once
only function.

Continuing to investigate further.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2013-05-19 16:35:22 Re: Assertion failure when promoting node by deleting recovery.conf and restart node
Previous Message Simon Riggs 2013-05-19 14:22:49 Re: fast promotion and log_checkpoints