Re: [BUG?] lag of minRecoveryPont in archive recovery

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUG?] lag of minRecoveryPont in archive recovery
Date: 2012-12-11 16:37:02
Message-ID: 50C7612E.5060008@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11.12.2012 17:04, Fujii Masao wrote:
> On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:
>>>>
>>>> The cause is that CheckRecoveryConsistency() is called before rm_redo(),
>>>> as Horiguchi-san suggested upthead. Imagine the case where
>>>> minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
>>>> record. When restarting the server with that minRecoveryPoint,
>>>> the followings would happen, and then PANIC occurs.
>>>>
>>>> 1. XLOG_SMGR_TRUNCATE record is read.
>>>> 2. CheckRecoveryConsistency() is called, and database is marked as
>>>> consistent since we've reached minRecoveryPoint (i.e., the location
>>>> of XLOG_SMGR_TRUNCATE).
>>>> 3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
>>>> found.
>>>> 4. Since the database has already been marked as consistent, an invalid
>>>> page leads to PANIC.
>>>
>>>
>>> Exactly.
>>>
>>> In smgr_redo, EndRecPtr which is pointing the record next to
>>> SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
>>> hand, during the second startup of the standby,
>>> CheckRecoveryConsistency checks for consistency by
>>> XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
>>> just BEFORE the SMGR_TRUNCATE record is applied. So
>>> reachedConsistency becomes true just before the SMGR_TRUNCATE
>>> record will be applied. Bang!
>>
>>
>> Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have
>> thing, so that we'd notice that we're consistent earlier, so didn't pay
>> attention to that part.
>>
>> I think the real issue here is that CheckRecoveryConsistency should not be
>> comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it
>> currently does. EndRecPtr points to the end of the last record *read*, while
>> recoveryLastRecPtr points to the end of the last record *replayed*. The
>> latter is what CheckRecoveryConsistency should be concerned with.
>>
>> BTW, it occurs to me that we have two variables in the shared struct that
>> are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The
>> former points to the end of the last record successfully replayed, while
>> replayEndRecPtr points to the end of the last record successfully replayed,
>> or begin replayed at the moment. I find that confusing, so I suggest that we
>> rename recoveryLastRecPtr to make that more clear. Included in the attached
>> patch.
>
> The patch looks good. I confirmed that the PANIC error didn't happen,
> with the patch.

Thanks. After thinking about this more, I think it's still not quite
right. Now that we fix the check in CheckRecoveryConsistency, we have to
move the call to CheckRecoveryConsistency to after the rm_redo call.
Otherwise you don't enter hot standby mode after replaying the last
record, the one ending at minRecoveryPoint. You have to wait for one
more record to be read (but not replayed), so that
CheckRecoveryConsistency gets called and we let backends in.

As the code stands, the bad placement of CheckRecoveryConsistency is
compensated by the bug that we open up for hot standby one record too
early. The net result is that we open up for hot standby just before
replaying the last record that makes us consistent. The window for
someone to see the inconsistent state is really small, postmaster will
have to receive the signal, start accepting connections, and you have
connect and run a query, all before the startup process gets around to
replay the record its already read into memory.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2012-12-11 17:03:47 Re: [BUG?] lag of minRecoveryPont in archive recovery
Previous Message Andres Freund 2012-12-11 15:44:36 Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader