From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Fix slot synchronization with two_phase decoding enabled |
Date: | 2025-05-10 11:29:20 |
Message-ID: | CAA4eK1LH=AAAzGZp-_3vHhD6YQoEYsLv7sF4Mv-sQs=w4E59qw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote:
> >
> >
> > Yes, this is possible. Here is my theory as to how it can happen in the current
> > case. In the failed test, after the primary has prepared a transaction, the
> > transaction won't be replicated to the subscriber as two_phase was not
> > enabled for the slot. However, subsequent keepalive messages can send the
> > latest WAL location to the subscriber and get the confirmation of the same from
> > the subscriber without its origin being moved. Now, after we restart the apply
> > worker (due to disable/enable for a subscription), it will use the previous
> > origin_lsn to temporarily move back the confirmed flush LSN as explained in
> > one of the previous emails in another thread [1]. During this temporary
> > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at
> > and confirm_flush_lsn values, leading to the assertion failure. We see this
> > issue intermittently because it depends on the timing of slotsync worker's
> > request to fetch the slot's value.
>
> Based on this theory, I can reproduce the BF failure in the 040 tap-test on
> HEAD after applying the 0001 patch. This is achieved by using the injection
> point to stop the walsender from sending a keepalive before receiving the old
> origin position from the apply worker, ensuring the confirmed_flush
> consistently moves backward before slotsync.
>
> Additionally, I've reproduced the duplicate data issue on HEAD without slotsync
> using the attached script (after applying the injection point patch). This
> issue arises if we immediately disable the subscription after the
> confirm_flush_lsn moves backward, preventing the walsender from advancing the
> confirm_flush_lsn.
>
Script contents:
psql -d postgres -p $port_primary -c "create extension
injection_points;SELECT injection_points_attach('process-replies',
'wait');"
psql -d postgres -p $port_subscriber -c "alter subscription sub set
(two_phase =on); alter subscription sub enable ;"
sleep 1
psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"
I think what you said in the above paragraph is happening here. How
can walsender move back the confirm_flush_lsn backwards when it is
waiting due to the injection point? I think I am missing something
here. It would be good if you could add a few comments to your
scripts.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2025-05-10 11:45:22 | Re: Add an option to skip loading missing publication to avoid logical replication failure |
Previous Message | Matthias van de Meent | 2025-05-10 11:14:34 | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |