Re: [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly
Date: 2016-11-12 12:31:35
Message-ID: CAA4eK1LRdr+7Z2H3+y8+o9uY_Tqs2VUsb9rJOAxeNxXE3wf-hQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 10, 2016 at 10:43 AM, Tsunakawa, Takayuki
<tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com> wrote:
> From: pgsql-hackers-owner(at)postgresql(dot)org
>> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Robert Haas
>> OK. I agree that's a problem. However, your patch adds zero new comment
>> text while removing some existing comments, so I can't easily tell how it
>> solves that problem or whether it does so correctly. Even if I were smart
>> enough to figure it out, I wouldn't want to rely on the next person also
>> being that smart. This is obviously a subtle problem in tricky code, so
>> a clear explanation of the fix seems like a very good idea.
>
> The comment describes what the code is trying to achieve. Actually, I just imitated the code and comment of later major releases. The only difference between later releases and my patch (for 9.2) is whether the state is stored in XLogReaderStruct or as global variables. Below is the comment from 9.6, where the second paragraph describes what the two nested if conditions mean. The removed comment lines are what became irrelevant, which is also not present in later major releases.
>
> /*
> * Since child timelines are always assigned a TLI greater than their
> * immediate parent's TLI, we should never see TLI go backwards across
> * successive pages of a consistent WAL sequence.
> *
> * Sometimes we re-read a segment that's already been (partially) read. So
> * we only verify TLIs for pages that are later than the last remembered
> * LSN.
> */
>

I think the changes which you are referring has been done as part of
commit 7fcbf6a405ffc12a4546a25b98592ee6733783fc. There is no mention
of such a bug fix in that commit; however, it is quite possible that
such a change has fixed the problem you have reported. It is not
clear if we can directly copy that change and it seems to me the
change copied is also not complete. It looks like the code in 9.3 or
later version uses the recptr as the target segment location
(targetSegmentPtr) whereas 9.2 uses recptr as beginning of segment
(readOff = 0;). If above understanding is right then it will set
different values for latestPagePtr in 9.2 and 9.3 onwards code.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-11-12 13:04:08 Re: Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.
Previous Message Andres Freund 2016-11-12 12:01:20 Re: Fix checkpoint skip logic on idle systems by tracking LSN progress