From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | RE: Conflict detection for update_deleted in logical replication |
Date: | 2025-06-04 10:42:36 |
Message-ID: | OS0PR01MB5716163486D6131F5B77A47B946CA@OS0PR01MB5716.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jun 2, 2025 at 2:39 PM Amit Kapila wrote:
>
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
> >
>
> Review comments:
> ===============
> *
> +advance_conflict_slot_xmin(FullTransactionId new_xmin) {
> +FullTransactionId full_xmin; FullTransactionId next_full_xid;
> +
> + Assert(MyReplicationSlot);
> + Assert(FullTransactionIdIsValid(new_xmin));
> +
> + next_full_xid = ReadNextFullTransactionId();
> +
> + /*
> + * Compute FullTransactionId for the current xmin. This handles the
> + case
> + * where transaction ID wraparound has occurred.
> + */
> + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
> + MyReplicationSlot->data.xmin);
> +
> + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) return;
>
> The above code suggests that the launcher could compute a new xmin that is
> less than slot's xmin. At first, this looks odd to me, but IIUC, this can happen
> when the user toggles retain_conflict_info flag at some random time when the
> launcher is trying to compute the new xmin value for the slot. One of the
> possible combinations of steps for this race could be as follows:
>
> 1. The subscriber has two subscriptions, A and B. Subscription A has
> retain_conflict_info as true, and B has retain_conflict_info as false 2. Say the
> launcher calls get_subscription_list(), and worker A is already alive.
> 3. Assuming the apply worker will restart on changing retain_conflict_info, the
> user enables retain_conflict_info for subscription B.
> 4. The launcher processes the subscription B first in the first cycle, and starts
> worker B. Say, worker B gets 759 as candidate_xid.
> 5. The launcher creates the conflict detection slot, xmin = 759 6. Say a new txn
> happens, worker A gets 760 as candidate_xid and updates it to
> oldest_nonremovable_xid.
> 7. The launcher processes the subscription A in the first cycle, and the final
> xmin value is 760, because it only checks the oldest_nonremovable_xid from
> worker A. The launcher then updates the value to slot.xmin.
> 8. In the next cycle, the launcher finds that worker B has an older
> oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
> launher would have retreated the slot's xmin unless we had the above check in
> the quoted code.
>
> I think the above race is possible because the system lets the changed
> subscription values of a subscription take effect asynchronously by workers.
> The one more similar race condition handled by the patch is as follows:
>
> *
> ...
> + * It's necessary to use FullTransactionId here to mitigate potential
> + race
> + * conditions. Such scenarios might occur if the replication slot is
> + not
> + * yet created by the launcher while the apply worker has already
> + * initialized this field. During this period, a transaction ID
> + wraparound
> + * could falsely make this ID appear as if it originates from the
> + future
> + * w.r.t the transaction ID stored in the slot maintained by launcher.
> + See
> + * advance_conflict_slot_xmin.
> ...
> + FullTransactionId oldest_nonremovable_xid;
>
> This case can happen if the user disables and immediately enables the
> retain_conflict_info option. In this case, the launcher may drop the slot after
> noticing the disable. But the apply worker may not notice the disable and it only
> notices that the retain_conflict_info is still enabled, so it will keep maintaining
> oldest_nonremovable_xid when the slot is not created.
>
> It is okay to handle both the race conditions, but I am worried we may miss
> some such race conditions which could lead to difficult-to-find bugs. So, at
> least for the first version of this option (aka for patches 0001 to 0003), we can
> add a condition that allows us to change retain_conflict_info only on disabled
> subscriptions. This will simplify the patch.
Agreed.
> We can make a separate patch to
> allow changing retain_conflict_info option for enabled subscriptions. That will
> make it easier to evaluate such race conditions and the solutions more deeply.
I will prepare a separate patch soon.
Here is the V33 patch set which includes the following changes:
0001:
* Renaming and typo fixes based on Shveta's comments [1]
* Comment changes suggested by Amit [2]
* Changed oldest_nonremoable_xid from FullTransactionID to TransactionID.
* Code refactoring in drop_conflict_slot_if_exists()
0002:
* Documentation updates suggested by Amit [2]
* Code modifications to adapt to TransactionID oldest_nonremoable_xid
0003:
* Documentation improvements suggested by Shveta [3]
* Added restriction: disallow changing retain_conflict_info when sub
is enabled or worker is alive
0004:
* Simplified race condition handling due to the new restriction from 0003
0005:
* Code updates to accommodate both the TransactionID type for
oldest_nonremoable_xid and the new restriction from 0003
0006:
* New test case for the restriction introduced in 0003
0007:
No changes
[1] https://www.postgresql.org/message-id/CAJpy0uBSsRuVOeuo-i8R_aO0CMiORHTnEBZ9z-TDq941WqhyLA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1KUTHbgroBRNp8_dy3Lrc%2BetPm19O1RcyRcDBgCp7EFcg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAJpy0uAJUTmSx7fAE3gbnBUzp9ZDOgkLrP5gdoysKUGbvf7vGg%40mail.gmail.com
Best Regards,
Hou zj
Attachment | Content-Type | Size |
---|---|---|
v33-0007-Support-the-conflict-detection-for-update_delete.patch | application/octet-stream | 25.6 KB |
v33-0001-Maintain-the-oldest-non-removable-transaction-ID.patch | application/octet-stream | 43.5 KB |
v33-0002-Maintain-the-replication-slot-in-logical-launche.patch | application/octet-stream | 19.6 KB |
v33-0003-Add-a-retain_conflict_info-option-to-subscriptio.patch | application/octet-stream | 91.5 KB |
v33-0004-Introduce-a-new-GUC-max_conflict_retention_durat.patch | application/octet-stream | 28.2 KB |
v33-0005-Re-create-the-replication-slot-if-the-conflict-r.patch | application/octet-stream | 9.8 KB |
v33-0006-Add-a-tap-test-to-verify-the-management-of-the-n.patch | application/octet-stream | 7.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2025-06-04 10:45:32 | Re: Add SPLIT PARTITION/MERGE PARTITIONS commands |
Previous Message | Ajin Cherian | 2025-06-04 10:41:59 | Re: Restrict publishing of partitioned table with a foreign table as partition |