Re: Infinite loop in XLogPageRead() on standby

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: cyberdemn(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Infinite loop in XLogPageRead() on standby
Date: 2024-02-29 07:18:14
Message-ID: 20240229.161814.1585171803334193240.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote:
> > I spent some time debugging an issue with standby not being able to
> > continue streaming after failover.
> >
> > The problem happens when standbys received only the first part of the WAL
> > record that spans multiple pages.
> > In this case the promoted standby discards the first part of the WAL record
> > and writes END_OF_RECOVERY instead. If in addition to that someone will
> > call pg_switch_wal(), then there are chances that SWITCH record will also
> > fit to the page where the discarded part was settling, As a result the
> > other standby (that wasn't promoted) will infinitely try making attempts to
> > decode WAL record span on multiple pages by reading the next page, which is
> > filled with zero bytes. And, this next page will never be written, because
> > the new primary will be writing to the new WAL file after pg_switch_wal().

In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.

I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.

In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment. But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming. A similar problem
to the issue here was seen at segment boundaries, before we introduced
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag, which prevents overwriting
a WAL file that is already archived. However, in this case, the second
standby won't see the broken record because it cannot be in a
non-partial segment in the archive, and the new primary streams
END_OF_RECOVERY instead of the broken record.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-02-29 07:20:53 Propagate sanity checks of ProcessUtility() to standard_ProcessUtility()?
Previous Message Bertrand Drouvot 2024-02-29 07:09:03 Re: Synchronizing slots from primary to standby