Re: Fix slot synchronization with two_phase decoding enabled

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fix slot synchronization with two_phase decoding enabled
Date: 2025-05-10 11:29:20
Message-ID: CAA4eK1LH=AAAzGZp-_3vHhD6YQoEYsLv7sF4Mv-sQs=w4E59qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote:
> >
> >
> > Yes, this is possible. Here is my theory as to how it can happen in the current
> > case. In the failed test, after the primary has prepared a transaction, the
> > transaction won't be replicated to the subscriber as two_phase was not
> > enabled for the slot. However, subsequent keepalive messages can send the
> > latest WAL location to the subscriber and get the confirmation of the same from
> > the subscriber without its origin being moved. Now, after we restart the apply
> > worker (due to disable/enable for a subscription), it will use the previous
> > origin_lsn to temporarily move back the confirmed flush LSN as explained in
> > one of the previous emails in another thread [1]. During this temporary
> > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at
> > and confirm_flush_lsn values, leading to the assertion failure. We see this
> > issue intermittently because it depends on the timing of slotsync worker's
> > request to fetch the slot's value.
>
> Based on this theory, I can reproduce the BF failure in the 040 tap-test on
> HEAD after applying the 0001 patch. This is achieved by using the injection
> point to stop the walsender from sending a keepalive before receiving the old
> origin position from the apply worker, ensuring the confirmed_flush
> consistently moves backward before slotsync.
>
> Additionally, I've reproduced the duplicate data issue on HEAD without slotsync
> using the attached script (after applying the injection point patch). This
> issue arises if we immediately disable the subscription after the
> confirm_flush_lsn moves backward, preventing the walsender from advancing the
> confirm_flush_lsn.
>

Script contents:
psql -d postgres -p $port_primary -c "create extension
injection_points;SELECT injection_points_attach('process-replies',
'wait');"

psql -d postgres -p $port_subscriber -c "alter subscription sub set
(two_phase =on); alter subscription sub enable ;"

sleep 1

psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"

I think what you said in the above paragraph is happening here. How
can walsender move back the confirm_flush_lsn backwards when it is
waiting due to the injection point? I think I am missing something
here. It would be good if you could add a few comments to your
scripts.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-05-10 11:45:22 Re: Add an option to skip loading missing publication to avoid logical replication failure
Previous Message Matthias van de Meent 2025-05-10 11:14:34 Re: Adding skip scan (including MDAM style range skip scan) to nbtree