From: | Alexander Kukushkin <cyberdemn(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: Infinite loop in XLogPageRead() on standby |
Date: | 2024-02-29 16:36:29 |
Message-ID: | CAFh8B=nPSERv7NyYHmjVXK4xK3va1XzU3-rhOswjgEZMWkV=RQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Michael,
On Thu, 29 Feb 2024 at 06:05, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> Wow. Have you seen that in an actual production environment?
>
Yes, we see it regularly, and it is reproducible in test environments as
well.
> my $start_page = start_of_page($end_lsn);
> my $wal_file = write_wal($primary, $TLI, $start_page,
> "\x00" x $WAL_BLOCK_SIZE);
> # copy the file we just "hacked" to the archive
> copy($wal_file, $primary->archive_dir);
>
> So you are emulating a failure by filling with zeros the second page
> where the last emit_message() generated a record, and the page before
> that includes the continuation record. Then abuse of WAL archiving to
> force the replay of the last record. That's kind of cool.
>
Right, at this point it is easier than to cause an artificial crash on the
primary after it finished writing just one page.
> > To be honest, I don't know yet how to fix it nicely. I am thinking about
> > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a
> new
> > timeline while trying to read a page and if this page is invalid.
>
> Hmm. I suspect that you may be right on a TLI change when reading a
> page. There are a bunch of side cases with continuation records and
> header validation around XLogReaderValidatePageHeader(). Perhaps you
> have an idea of patch to show your point?
>
Not yet, but hopefully I will get something done next week.
>
> Nit. In your test, it seems to me that you should not call directly
> set_standby_mode and enable_restoring, just rely on has_restoring with
> the standby option included.
>
Thanks, I'll look into it.
--
Regards,
--
Alexander Kukushkin
From | Date | Subject | |
---|---|---|---|
Next Message | Dean Rasheed | 2024-02-29 16:37:28 | Re: Supporting MERGE on updatable views |
Previous Message | Nathan Bossart | 2024-02-29 16:34:12 | Re: Atomic ops for unlogged LSN |