Re: PITR promote bug: Checkpointer writes to older timeline

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: soumyadeep2007(at)gmail(dot)com
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org, michael(at)paquier(dot)xyz, jyih(at)vmware(dot)com, kyeap(at)vmware(dot)com
Subject: Re: PITR promote bug: Checkpointer writes to older timeline
Date: 2021-03-04 01:28:31
Message-ID: 20210304.102831.755904679464661262.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 3 Mar 2021 14:56:25 -0800, Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com> wrote in
> On 2021/03/03 17:46, Heikki Linnakangas wrote:
>
> > I think it should be reset even earlier, inside XlogReadTwoPhaseData()
> > probably. With your patch, doesn't the LogStandbySnapshot() call just
> > above where you're ressetting ThisTimeLineID also write a WAL record
> > with incorrect timeline?
>
> Agreed.

Right.

> On Wed, Mar 3, 2021 at 1:04 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
> > > Even better, can we avoid setting ThisTimeLineID in XlogReadTwoPhaseData() in the first place?
> >
> >
> >
> > Or isn't it better to reset ThisTimeLineID in read_local_xlog_page(), i.e.,
> > prevent read_local_xlog_page() from changing ThisTimeLineID? I'm not
> > sure if that's possible, though.. In the future other functions that calls
> > read_local_xlog_page() during the promotion may appear. Fixing the issue
> > outside read_local_xlog_page() may cause those functions to get
> > the same issue.
>
> I agree. We should fix the issue in read_local_xlog_page(). I have
> attached two different patches which do so:
> saved_ThisTimeLineID.patch and pass_ThisTimeLineID.patch.
>
> The former saves the value of the ThisTimeLineID before it gets changed
> in read_local_xlog_page() and resets it after ThisTimeLineID has been
> used later on in the code (by XLogReadDetermineTimeline()).
>
> The latter removes occurrences of ThisTimeLineID from
> XLogReadDetermineTimeline() and introduces an argument currTLI to
> XLogReadDetermineTimeline() to be used in its stead.

read_local_xlog_page() works as a part of logical decoding and has
responsibility to update ThisTimeLineID properly. As the comment in
the function, it is the proper place to update ThisTimeLineID since we
miss a timeline change if we check it earlier and the function uses
the value just after. So we cannot change that behavior of the
function. That is, neither of them doesn't seem to be the right fix.

The confusion here is that there's two ThisTimeLineID's here. The
previous TLI for read and the next TLI to write. Most part of the
function is needed to read a 2pc recrod so the ways we can take here
is:

1. Somehow tell the function not to update ThisTimeLineID in specific
cases. This can be done by xlogreader private data but this doesn't
seem reasonable.

2. Restore ThisTimeLineID after calling XLogReadRecord() in the
*caller* side. This is what came up to me first but I don't like
this, too, but I don't find better fix. way.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
dont_change_thistimelineid.patch text/x-patch 911 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-03-04 01:35:19 n_mod_since_analyze isn't reset at table truncation
Previous Message Andy Fan 2021-03-04 01:05:29 Re: Make Append Cost aware of some run time partition prune case