Re: Infinite loop in XLogPageRead() on standby

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Infinite loop in XLogPageRead() on standby
Date: 2024-02-29 16:36:29
Message-ID: CAFh8B=nPSERv7NyYHmjVXK4xK3va1XzU3-rhOswjgEZMWkV=RQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Michael,

On Thu, 29 Feb 2024 at 06:05, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

>
> Wow. Have you seen that in an actual production environment?
>

Yes, we see it regularly, and it is reproducible in test environments as
well.

> my $start_page = start_of_page($end_lsn);
> my $wal_file = write_wal($primary, $TLI, $start_page,
> "\x00" x $WAL_BLOCK_SIZE);
> # copy the file we just "hacked" to the archive
> copy($wal_file, $primary->archive_dir);
>
> So you are emulating a failure by filling with zeros the second page
> where the last emit_message() generated a record, and the page before
> that includes the continuation record. Then abuse of WAL archiving to
> force the replay of the last record. That's kind of cool.
>

Right, at this point it is easier than to cause an artificial crash on the
primary after it finished writing just one page.

> > To be honest, I don't know yet how to fix it nicely. I am thinking about
> > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a
> new
> > timeline while trying to read a page and if this page is invalid.
>
> Hmm. I suspect that you may be right on a TLI change when reading a
> page. There are a bunch of side cases with continuation records and
> header validation around XLogReaderValidatePageHeader(). Perhaps you
> have an idea of patch to show your point?
>

Not yet, but hopefully I will get something done next week.

>
> Nit. In your test, it seems to me that you should not call directly
> set_standby_mode and enable_restoring, just rely on has_restoring with
> the standby option included.
>

Thanks, I'll look into it.

--
Regards,
--
Alexander Kukushkin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2024-02-29 16:37:28 Re: Supporting MERGE on updatable views
Previous Message Nathan Bossart 2024-02-29 16:34:12 Re: Atomic ops for unlogged LSN