Re: Infinite loop in XLogPageRead() on standby

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: cyberdemn(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Infinite loop in XLogPageRead() on standby
Date: 2024-03-01 01:29:12
Message-ID: 20240301.102912.2136692012665276484.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
> > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
> > wrote:
> >> In the first place, it's important to note that we do not guarantee
> >> that an async standby can always switch its replication connection to
> >> the old primary or another sibling standby. This is due to the
> >> variations in replication lag among standbys. pg_rewind is required to
> >> adjust such discrepancies.
> >
> > Sure, I know. But in this case the async standby received and flushed
> > absolutely the same amount of WAL as the promoted one.
>
> Ugh. If it can happen transparently to the user without the user
> knowing directly about it, that does not sound good to me. I did not
> look very closely at monitoring tools available out there, but if both
> standbys flushed the same WAL locations a rewind should not be
> required. It is not something that monitoring tools would be able to
> detect because they just look at LSNs.
>
> >> In the repro script, the replication connection of the second standby
> >> is switched from the old primary to the first standby after its
> >> promotion. After the switching, replication is expected to continue
> >> from the beginning of the last replayed segment.
> >
> > Well, maybe, but apparently the standby is busy trying to decode a record
> > that spans multiple pages, and it is just infinitely waiting for the next
> > page to arrive. Also, the restart "fixes" the problem, because indeed it is
> > reading the file from the beginning.
>
> What happens if the continuation record spawns across multiple segment
> files boundaries in this case? We would go back to the beginning of
> the segment where the record spawning across multiple segments began,
> right? (I may recall this part of contrecords incorrectly, feel free
> to correct me if necessary.)

After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.

> >> But with the script,
> >> the second standby copies the intentionally broken file, which differs
> >> from the data that should be received via streaming.
> >
> > As I already said, this is a simple way to emulate the primary crash while
> > standbys receiving WAL.
> > It could easily happen that the record spans on multiple pages is not fully
> > received and flushed.
>
> I think that's OK to do so at test level to force a test in the
> backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
> proved that this can be designed to be cheap and stable across the
> buildfarm fleet.

Yeah, I agree that it clearly illustrates the final state after the
issue happened, but if my assumption above is correct, the test
doesn't manifest the real issue.

> For anything like that, the result had better have solid test
> coverage, where perhaps we'd better refactor some of the routines of
> 039_end_of_wal.pl into a module to use them here, rather than
> duplicate the code. The other test has a few assumptions with the
> calculation of page boundaries, and I'd rather not duplicate that
> across the tree.

I agree to the point.

> Seeing that Alexander is a maintainer of Patroni, which is very
> probably used by his employer across a large set of PostgreSQL
> instances, if he says that he's seen that in the field, that's good
> enough for me to respond to the problem, especially if reconnecting a
> standby to a promoted node where both flushed the same LSN. Now the
> level of response also depends on the invasiness of the change, and we
> need a very careful evaluation here.

I don't mean to say that we should respond with DNF to this "issue" at
all. I simply wanted to make the real issue clear.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-03-01 01:56:50 Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
Previous Message Melanie Plageman 2024-03-01 01:18:20 Re: BitmapHeapScan streaming read user and prelim refactoring