Infinite loop in XLogPageRead() on standby

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Infinite loop in XLogPageRead() on standby
Date: 2024-02-28 10:19:41
Message-ID: CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

I spent some time debugging an issue with standby not being able to
continue streaming after failover.

The problem manifests itself by following messages in the log:
LOG: received SIGHUP, reloading configuration files
LOG: parameter "primary_conninfo" changed to "port=58669
host=/tmp/dn20WVmNqF"
LOG: restored log file "000000010000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000010000000000000003,
LSN 0/301A000, offset 106496
LOG: fetching timeline history file for timeline 2 from primary server
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/3019158.
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "00000002.history" from archive
LOG: new target timeline is 2
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
FATAL: terminating walreceiver process due to administrator command
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496

The problem happens when standbys received only the first part of the WAL
record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record
and writes END_OF_RECOVERY instead. If in addition to that someone will
call pg_switch_wal(), then there are chances that SWITCH record will also
fit to the page where the discarded part was settling, As a result the
other standby (that wasn't promoted) will infinitely try making attempts to
decode WAL record span on multiple pages by reading the next page, which is
filled with zero bytes. And, this next page will never be written, because
the new primary will be writing to the new WAL file after pg_switch_wal().

Restart of the stuck standby fixes the problem, because it will be first
reading the history file and therefore will never read the incomplete WAL
file from the old timeline. That is, all major versions starting from v13
are impacted (including the master branch), because they allow changing of
primary_conninfo GUC with reload.

Please find attached the TAP test that reproduces the problem.

To be honest, I don't know yet how to fix it nicely. I am thinking about
returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
timeline while trying to read a page and if this page is invalid.

--
Regards,
--
Alexander Kukushkin

Attachment Content-Type Size
040_no_contrecord_switch.pl application/x-perl 6.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2024-02-28 10:27:41 Re: POC, WIP: OR-clause support for indexes
Previous Message jian he 2024-02-28 10:07:21 Re: POC, WIP: OR-clause support for indexes