Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline

From: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline
Date: 2026-06-12 00:57:05
Message-ID: CABPTF7WSpNOYu84fjGH2t56BctRzVD7t8WqhgvML2DRh8Vtfog@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Michael,

On Thu, Jun 11, 2026 at 9:15 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Wed, Jun 10, 2026 at 05:28:00PM +0000, Bertrand Drouvot wrote:
> > On Wed, Jun 10, 2026 at 04:36:14PM +0800, Xuneng Zhou wrote:
> >> The
> >> essential thing is just to ensure that the startup remains paused
> >> until decoding output is observed.
> >
> > Right, thanks for confirming. That's exactly what v2 is doing.
>
> I have looked at this thread, and my first impression was that this
> could be a data integrity issue while decoding changes due to the
> transient errors one could see across the promotion requests.
>
> But it's less severe than I thought initially: we have an availability
> problem here, down to v16, with a correct recovery possible once the
> promotion request has completed. That could be indeed surprising for
> users that have HA setups with standbys doing logical decoding.. The
> SQL function path is less worrying to me, there are as far as I know
> few users of it compared to the "native" path with sync workers.
>
> read_local_xlog_page_guts() does not only impact SQL-callable logirep
> functions, even it is the spot that should be hit most of the time
> (again, the RecoveryInProgress() vs promotion window is super narrow).
> At quick glance, things are:
> - walinspect.
> - Slot advance.
> - Slot creation (?), but it feels even narrower.

Yeah, it is used for two-phase commit as well. The usage of it is
broader than I observed before. Repack worker also make use of it.

> With two items dealt with on this thread for these two callback paths
> changed, moving on the part related to physical replication into its
> own thread would be better. This requires an entirely different
> analysis and a different lookup.

+1

> The backpatch of PG16 is straight-forward and adding
> GetWALInsertionTimeLineIfSet() down there does not look like an issue.
> Not having any tests in v16 feels sad, but that's life. It does not
> prevent addressing the availability issue on this branch.
>
> I'll go take it up from here.
> --

Thanks for dealing with this!

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Curtis 2026-06-12 01:04:38 Re: Fix race condition in SSI when reading PredXact->SxactGlobalXmin
Previous Message Chao Li 2026-06-12 00:51:24 Re: amcheck: fix bug of missing corruption in allequalimage validation