Re: PITR promote bug: Checkpointer writes to older timeline

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, soumyadeep2007(at)gmail(dot)com
Cc: hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org, michael(at)paquier(dot)xyz, jyih(at)vmware(dot)com, kyeap(at)vmware(dot)com
Subject: Re: PITR promote bug: Checkpointer writes to older timeline
Date: 2021-03-04 05:57:13
Message-ID: ddc31aa9-8083-58b7-0b47-e371cd4c705b@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/03/04 10:28, Kyotaro Horiguchi wrote:
> At Wed, 3 Mar 2021 14:56:25 -0800, Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com> wrote in
>> On 2021/03/03 17:46, Heikki Linnakangas wrote:
>>
>>> I think it should be reset even earlier, inside XlogReadTwoPhaseData()
>>> probably. With your patch, doesn't the LogStandbySnapshot() call just
>>> above where you're ressetting ThisTimeLineID also write a WAL record
>>> with incorrect timeline?
>>
>> Agreed.
>
> Right.
>
>> On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>
>>>> Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() in the first place?
>>>
>>>
>>>
>>> Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e.,
>>> prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not
>>> sure if that's possible, though.. In the future other functions that calls
>>> read_local_xlog_page() during the promotion may appear. Fixing the issue
>>> outside read_local_xlog_page() may cause those functions to get
>>> the same issue.
>>
>> I agree. We should fix the issue in read_local_xlog_page(). I have
>> attached two different patches which do so:
>> saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch.
>>
>> The former saves the value of the ThisTimeLineID before it gets changed
>> in read_local_xlog_page() and resets it after ThisTimeLineID has been
>> used later on in the code (by XLogReadDetermineTimeline()).
>>
>> The latter removes occurrences of ThisTimeLineID from
>> XLogReadDetermineTimeline() and introduces an argument currTLI to
>> XLogReadDetermineTimeline() to be used in its stead.
>
> read_local_xlog_page() works as a part of logical decoding and has
> responsibility to update ThisTimeLineID properly. As the comment in
> the function, it is the proper place to update ThisTimeLineID since we
> miss a timeline change if we check it earlier and the function uses
> the value just after. So we cannot change that behavior of the
> function. That is, neither of them doesn't seem to be the right fix.

Could you tell me what actual issue happens if read_local_xlog_page() resets
ThisTimeLineID at the end? Some replication slot-related functions that use
read_local_xlog_page() can be executed even during recovery. For example,
you mean that, when timeline swithes during recovery, those functions
behave incorrectly if ThisTimeLineID is reset?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2021-03-04 06:03:03 Re: Wired if-statement in gen_partprune_steps_internal
Previous Message Ajay Patel 2021-03-04 05:22:36 Re: How to provide Documentation Feedback