Re: BLK_DONE state in XLogReadBufferForRedoExtended

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BLK_DONE state in XLogReadBufferForRedoExtended
Date: 2017-10-16 12:50:33
Message-ID: CAA4eK1Jeafx-FcMdVUHXDFU4T_X1_R0gqCi+n7JA6HTq2rR=rA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 13, 2017 at 11:57 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Oct 13, 2017 at 10:32 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Thu, Oct 12, 2017 at 10:47 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> Today, I was trying to think about cases when we can return BLK_DONE
>>> in XLogReadBufferForRedoExtended. One thing that occurred to me is
>>> that it can happen during the replay of WAL if the full_page_writes is
>>> off. Basically, when the full_page_writes is on, then during crash
>>> recovery, it will always first restore the full page image of page and
>>> then apply the WAL corresponding to that page, so we will never hit
>>> the case where the lsn of the page is greater than lsn of WAL record.
>>>
>>> Are there other cases for which we can get BLK_DONE state? Is it
>>> mentioned somewhere in code/comments which I am missing?
>>
>> Remember the thread about meta page optimization... Some index AMs use
>> XLogInitBufferForRedo() to redo their meta pages and those don't have
>> a FPW so when doing crash recovery you may finish by not replaying a
>> meta page if its LSN on the page header is newer than what's being
>> replayed.
>>
>
> I think for metapage usage, it will anyway apply the changes. See
> _bt_restore_page.
>
>> That's another case where BLK_DONE can be reached, even if
>> full_page_writes is on.
>>
>
> Yeah and probably if someone uses REGBUF_NO_IMAGE. However, I was
> mainly thinking about cases where caller is not doing anything to
> prevent full_page_image explicitly.
>
>

If above analysis is correct, then I think we can say that row state
in a page will be same during recovery as it was when the original
operation was performed if the full_page_writes are enabled. I am not
sure how much this can help in current heap format, but this can help
in zheap (undo based heap).

In zheap, we are writing complete tuple for Delete operation in undo
so that we can reclaim the corresponding tuple space as soon as the
deleting transaction is committed. Now, during recovery, we have to
generate the complete undo record (which includes the entire tuple)
and for that ideally, we should write the complete tuple in WAL, but
instead of that, I think we can regenerate it from the original page.
This is only applicable when full_page_writes are enabled, otherwise,
a complete tuple is required in WAL.

I am not sure how much above makes sense to anyone without a detailed
explanation, but I thought I should give some background on why I
asked this question. However, if anybody needs more explanation or
sees any fault in above understanding, please let me know.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-10-16 13:42:34 Aggregate FILTER option is broken in v10
Previous Message alain radix 2017-10-16 12:11:50 ERROR: MultiXactId 3268957 has not been created yet -- apparent wraparound after missused pg_resetxlogs