RE: Conflict detection for update_deleted in logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Conflict detection for update_deleted in logical replication
Date: 2025-06-04 10:42:36
Message-ID: OS0PR01MB5716163486D6131F5B77A47B946CA@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 2, 2025 at 2:39 PM Amit Kapila wrote:
>
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
> >
>
> Review comments:
> ===============
> *
> +advance_conflict_slot_xmin(FullTransactionId new_xmin) {
> +FullTransactionId full_xmin; FullTransactionId next_full_xid;
> +
> + Assert(MyReplicationSlot);
> + Assert(FullTransactionIdIsValid(new_xmin));
> +
> + next_full_xid = ReadNextFullTransactionId();
> +
> + /*
> + * Compute FullTransactionId for the current xmin. This handles the
> + case
> + * where transaction ID wraparound has occurred.
> + */
> + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
> + MyReplicationSlot->data.xmin);
> +
> + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) return;
>
> The above code suggests that the launcher could compute a new xmin that is
> less than slot's xmin. At first, this looks odd to me, but IIUC, this can happen
> when the user toggles retain_conflict_info flag at some random time when the
> launcher is trying to compute the new xmin value for the slot. One of the
> possible combinations of steps for this race could be as follows:
>
> 1. The subscriber has two subscriptions, A and B. Subscription A has
> retain_conflict_info as true, and B has retain_conflict_info as false 2. Say the
> launcher calls get_subscription_list(), and worker A is already alive.
> 3. Assuming the apply worker will restart on changing retain_conflict_info, the
> user enables retain_conflict_info for subscription B.
> 4. The launcher processes the subscription B first in the first cycle, and starts
> worker B. Say, worker B gets 759 as candidate_xid.
> 5. The launcher creates the conflict detection slot, xmin = 759 6. Say a new txn
> happens, worker A gets 760 as candidate_xid and updates it to
> oldest_nonremovable_xid.
> 7. The launcher processes the subscription A in the first cycle, and the final
> xmin value is 760, because it only checks the oldest_nonremovable_xid from
> worker A. The launcher then updates the value to slot.xmin.
> 8. In the next cycle, the launcher finds that worker B has an older
> oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
> launher would have retreated the slot's xmin unless we had the above check in
> the quoted code.
>
> I think the above race is possible because the system lets the changed
> subscription values of a subscription take effect asynchronously by workers.
> The one more similar race condition handled by the patch is as follows:
>
> *
> ...
> + * It's necessary to use FullTransactionId here to mitigate potential
> + race
> + * conditions. Such scenarios might occur if the replication slot is
> + not
> + * yet created by the launcher while the apply worker has already
> + * initialized this field. During this period, a transaction ID
> + wraparound
> + * could falsely make this ID appear as if it originates from the
> + future
> + * w.r.t the transaction ID stored in the slot maintained by launcher.
> + See
> + * advance_conflict_slot_xmin.
> ...
> + FullTransactionId oldest_nonremovable_xid;
>
> This case can happen if the user disables and immediately enables the
> retain_conflict_info option. In this case, the launcher may drop the slot after
> noticing the disable. But the apply worker may not notice the disable and it only
> notices that the retain_conflict_info is still enabled, so it will keep maintaining
> oldest_nonremovable_xid when the slot is not created.
>
> It is okay to handle both the race conditions, but I am worried we may miss
> some such race conditions which could lead to difficult-to-find bugs. So, at
> least for the first version of this option (aka for patches 0001 to 0003), we can
> add a condition that allows us to change retain_conflict_info only on disabled
> subscriptions. This will simplify the patch.

Agreed.

> We can make a separate patch to
> allow changing retain_conflict_info option for enabled subscriptions. That will
> make it easier to evaluate such race conditions and the solutions more deeply.

I will prepare a separate patch soon.

Here is the V33 patch set which includes the following changes:

0001:
* Renaming and typo fixes based on Shveta's comments [1]
* Comment changes suggested by Amit [2]
* Changed oldest_nonremoable_xid from FullTransactionID to TransactionID.
* Code refactoring in drop_conflict_slot_if_exists()

0002:
* Documentation updates suggested by Amit [2]
* Code modifications to adapt to TransactionID oldest_nonremoable_xid

0003:
* Documentation improvements suggested by Shveta [3]
* Added restriction: disallow changing retain_conflict_info when sub
is enabled or worker is alive

0004:
* Simplified race condition handling due to the new restriction from 0003

0005:
* Code updates to accommodate both the TransactionID type for
oldest_nonremoable_xid and the new restriction from 0003

0006:
* New test case for the restriction introduced in 0003

0007:
No changes

[1] https://www.postgresql.org/message-id/CAJpy0uBSsRuVOeuo-i8R_aO0CMiORHTnEBZ9z-TDq941WqhyLA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1KUTHbgroBRNp8_dy3Lrc%2BetPm19O1RcyRcDBgCp7EFcg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAJpy0uAJUTmSx7fAE3gbnBUzp9ZDOgkLrP5gdoysKUGbvf7vGg%40mail.gmail.com

Best Regards,
Hou zj

Attachment Content-Type Size
v33-0007-Support-the-conflict-detection-for-update_delete.patch application/octet-stream 25.6 KB
v33-0001-Maintain-the-oldest-non-removable-transaction-ID.patch application/octet-stream 43.5 KB
v33-0002-Maintain-the-replication-slot-in-logical-launche.patch application/octet-stream 19.6 KB
v33-0003-Add-a-retain_conflict_info-option-to-subscriptio.patch application/octet-stream 91.5 KB
v33-0004-Introduce-a-new-GUC-max_conflict_retention_durat.patch application/octet-stream 28.2 KB
v33-0005-Re-create-the-replication-slot-if-the-conflict-r.patch application/octet-stream 9.8 KB
v33-0006-Add-a-tap-test-to-verify-the-management-of-the-n.patch application/octet-stream 7.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-06-04 10:45:32 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Previous Message Ajin Cherian 2025-06-04 10:41:59 Re: Restrict publishing of partitioned table with a foreign table as partition