Re: Fix slot synchronization with two_phase decoding enabled

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Fix slot synchronization with two_phase decoding enabled
Date: 2025-05-12 08:40:56
Message-ID: CAJpy0uDWtcaV1BGVqUdhBx42-Vs5Wm47wvFuXvNMHAh8=sE1Jg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 10, 2025 at 4:59 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote:
> > >
> > >
> > > Yes, this is possible. Here is my theory as to how it can happen in the current
> > > case. In the failed test, after the primary has prepared a transaction, the
> > > transaction won't be replicated to the subscriber as two_phase was not
> > > enabled for the slot. However, subsequent keepalive messages can send the
> > > latest WAL location to the subscriber and get the confirmation of the same from
> > > the subscriber without its origin being moved. Now, after we restart the apply
> > > worker (due to disable/enable for a subscription), it will use the previous
> > > origin_lsn to temporarily move back the confirmed flush LSN as explained in
> > > one of the previous emails in another thread [1]. During this temporary
> > > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at
> > > and confirm_flush_lsn values, leading to the assertion failure. We see this
> > > issue intermittently because it depends on the timing of slotsync worker's
> > > request to fetch the slot's value.
> >
> > Based on this theory, I can reproduce the BF failure in the 040 tap-test on
> > HEAD after applying the 0001 patch. This is achieved by using the injection
> > point to stop the walsender from sending a keepalive before receiving the old
> > origin position from the apply worker, ensuring the confirmed_flush
> > consistently moves backward before slotsync.
> >
> > Additionally, I've reproduced the duplicate data issue on HEAD without slotsync
> > using the attached script (after applying the injection point patch). This
> > issue arises if we immediately disable the subscription after the
> > confirm_flush_lsn moves backward, preventing the walsender from advancing the
> > confirm_flush_lsn.
> >
>
> Script contents:
> psql -d postgres -p $port_primary -c "create extension
> injection_points;SELECT injection_points_attach('process-replies',
> 'wait');"
>
> psql -d postgres -p $port_subscriber -c "alter subscription sub set
> (two_phase =on); alter subscription sub enable ;"
>
> sleep 1
>
> psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"
>
> I think what you said in the above paragraph is happening here. How
> can walsender move back the confirm_flush_lsn backwards when it is
> waiting due to the injection point? I think I am missing something
> here. It would be good if you could add a few comments to your
> scripts.
>

Please see my comments in the attached (updated) script. The testcase
to reproduce the issue on HEAD is the same, only the comments have
been added to elaborate the flow which moves confirmed_flush backward.

thanks
Shveta

Attachment Content-Type Size
reproduce_without_slotsync_HEAD.sh text/x-sh 6.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Xuneng Zhou 2025-05-12 09:01:33 Re: Add an option to skip loading missing publication to avoid logical replication failure
Previous Message Dmitry Koval 2025-05-12 08:31:04 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands