Re: Infinite loop in XLogPageRead() on standby

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Infinite loop in XLogPageRead() on standby
Date: 2024-02-29 05:05:15
Message-ID: ZeAQi1UMyyWkspiT@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote:
> I spent some time debugging an issue with standby not being able to
> continue streaming after failover.
>
> The problem happens when standbys received only the first part of the WAL
> record that spans multiple pages.
> In this case the promoted standby discards the first part of the WAL record
> and writes END_OF_RECOVERY instead. If in addition to that someone will
> call pg_switch_wal(), then there are chances that SWITCH record will also
> fit to the page where the discarded part was settling, As a result the
> other standby (that wasn't promoted) will infinitely try making attempts to
> decode WAL record span on multiple pages by reading the next page, which is
> filled with zero bytes. And, this next page will never be written, because
> the new primary will be writing to the new WAL file after pg_switch_wal().

Wow. Have you seen that in an actual production environment?

I was just trying your TAP test to see it looping on a single record
as you mentioned:
2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in
WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in
WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496

> Restart of the stuck standby fixes the problem, because it will be first
> reading the history file and therefore will never read the incomplete WAL
> file from the old timeline. That is, all major versions starting from v13
> are impacted (including the master branch), because they allow changing of
> primary_conninfo GUC with reload.

Still that's not nice at a large scale, because you would not know
about the problem until your monitoring tools raise alarms because
some nodes in your cluster setup decide to lag behind.

> Please find attached the TAP test that reproduces the problem.

my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
"\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);

So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record. Then abuse of WAL archiving to
force the replay of the last record. That's kind of cool.

> To be honest, I don't know yet how to fix it nicely. I am thinking about
> returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
> timeline while trying to read a page and if this page is invalid.

Hmm. I suspect that you may be right on a TLI change when reading a
page. There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader(). Perhaps you
have an idea of patch to show your point?

Nit. In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tender Wang 2024-02-29 05:25:20 Re: "type with xxxx does not exist" when doing ExecMemoize()
Previous Message Andrew Dunstan 2024-02-29 05:01:10 Re: speed up a logical replica setup