From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Subject: | RE: Conflict detection for update_deleted in logical replication |
Date: | 2025-07-07 02:12:46 |
Message-ID: | OS0PR01MB5716A67EE9A4FB4EF361B5BC944FA@OS0PR01MB5716.jpnprd01.prod.outlook.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Jul 6, 2025 at 10:51 PM Masahiko Sawada wrote:
>
> On Fri, Jul 4, 2025 at 8:18 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> > On Wed, Jul 2, 2025 at 3:28 PM Hou, Zhijie wrote:
> > > Kindly use the latest patch set for performance testing.
> >
> > During testing, we observed a limitation in cascading logical
> > replication setups, such as (A -> B -> C). When retain_conflict_info
> > is enabled on Node C, it may not retain information necessary for
> > conflict detection when applying changes originally replicated from
> > Node A. This happens because Node C only waits for locally originated
> > changes on Node B to be applied before advancing the non-removable
> transaction ID.
> >
> > For example, Consider a logical replication setup as mentioned above : A -> B
> -> C.
> > - All three nodes have a table t1 with two tuples (1,1) (2,2).
> > - Node B subscribed to all changes of t1 from Node A
> > - Node-C subscribed to all changes from Node B.
> > - Subscriptions use the default origin=ANY, as this is not a bidirectional
> > setup.
> >
> > Now, consider two concurrent operations:
> > - @9:00 Node A - UPDATE (1,1) -> (1,11)
> >
> > - @9:02 Node C - DELETE (1,1)
> >
> > Assume a slight delay at Node B before it applies the update from Node A.
> >
> > @9:03 Node C - advances the non-removable XID because it sees no
> > concurrent transactions from Node B. It is unaware of Node A’s concurrent
> update.
> >
> > @9:04 Node B - receives Node A's UPDATE and applies (1,1) -> (1,11)
> > t1 has tuples : (1,11), (2,2)
> >
> > @9:05 Node C - receives the UPDATE (1,1) -> (1,11)
> > - As conflict slot’s xmin is advanced, the deleted tuple may already
> have
> > been removed.
> > - Conflict resolution fails to detect update_deleted and instead raises
> > update_missing.
> >
> > Note that, as per decoding logic Node C sees the commit timestamp of
> > the update as 9:00 (origin commit_ts from Node A), not 9:04 (commit
> > time on Node B). In this case, since the UPDATE's timestamp is earlier
> > than the DELETE, Node C should ideally detect an update_deleted
> > conflict. However, it cannot, because it no longer retains the deleted tuple.
> >
> > Even if Node C attempts to retrieve the latest WAL position from Node
> > A, Node C doesn't maintain any LSN which we could use to compare with it.
> >
> > This scenario is similar to another restriction in the patch where
> > retain_conflict_info is not supported if the publisher is also a
> > physical standby, as the required transaction information from the
> > original primary is unavailable. Moreover, this limitation is relevant
> > only when the subscription origin option is set to ANY, as only in
> > that case changes from other origins can be replicated. Since
> > retain_conflict_info is primarily useful for conflict detection in
> > bidirectional clusters where the origin option is set to NONE, this limitation
> appears acceptable.
> >
> > Given these findings, to help users avoid unintended configurations,
> > we plan to issue a warning in scenarios where replicated changes may
> > include origins other than the direct publisher, similar to the
> > existing checks in the
> > check_publications_origin() function.
> >
> > Here is the latest patch that implements the warning and documents
> > this case. Only 0001 is modified for this.
> >
> > A big thanks to Nisha for invaluable assistance in identifying this
> > case and preparing the analysis for it.
>
> I'm still reviewing the 0001 patch but let me share some comments and
> questions I have so far:
Thanks for the comments!
>
> ---
> It seems there is no place where we describe the overall idea of reliably
> detecting update_deleted conflicts. The comment atop
> maybe_advance_nonremovable_xid() describes why the implemented
> algorithm works for that purpose but doesn't how it is implemented, for
> example the relationship with pg_conflict_detection slot. I'm not sure the long
> comment atop maybe_advance_nonremovable_xid() is the right place as it
> seems to be a description beyond explaining
> maybe_advance_nonremovable_xid() function. Probably we can move that
> comment and explain the overall idea somewhere for example atop worker.c or
> in README.
I think it makes sense to explain it atop of worker.c, will do.
>
> ---
> The new parameter name "retain_conflict_info" sounds to me like we keep the
> conflict information somewhere that has expired at some time such as how
> many times insert_exists or update_origin_differs happened. How about
> choosing a name that indicates retain dead tuples more explicitly for example
> retain_dead_tuples?
We considered the name you suggested, but we wanted to convey that this option
not only retains dead tuples but also preserves commit timestamps and origin
data for conflict detection, hence we opted for a more general name. Do you
have better suggestions?
>
> ---
> You mentioned in the previous email:
>
> > Furthermore, we tested running pgbench on both publisher and
> subscriber[3].
> > Some regression was observed in TPS on the subscriber, because
> > workload on the publisher is pretty high and the apply workers must
> > wait for the amount of transactions with earlier timestamps to be
> > applied and flushed before advancing the non-removable XID to remove
> > dead tuples. This is the expected behavior of this approach since the
> > patch's main goal is to retain dead tuples for reliable conflict detection.
>
> Have you conducted any performance testing of a scenario where a publisher
> replicates a large number of databases (say 64) to a subscriber? I'm particularly
> interested in a configuration where retain_conflict_info is set to true, and there
> are 64 apply workers running on the subscriber side. In such a setup, even
> when running pgbench exclusively on the publisher's databases, I suspect the
> replication lag would likely increase quickly, as all apply workers on the
> subscriber would be impacted by the overhead of retaining dead tuples.
We will try this workload and share the feedback.
>
> ---
> @@ -71,8 +72,9 @@
> #define SUBOPT_PASSWORD_REQUIRED 0x00000800
> #define SUBOPT_RUN_AS_OWNER 0x00001000
> #define SUBOPT_FAILOVER 0x00002000
> -#define SUBOPT_LSN 0x00004000
> -#define SUBOPT_ORIGIN 0x00008000
> +#define SUBOPT_RETAIN_CONFLICT_INFO 0x00004000
> +#define SUBOPT_LSN 0x00008000
> +#define SUBOPT_ORIGIN 0x00010000
>
> Why do we need to change the existing options' value?
The intention is to position the new option after the failover option, ensuring
consistency with the order in the pg_subscription catalog. I think modifying existing
options in a major version is acceptable, as we have done similarly in commits
4826759 and 776621a.
>
> ---
> + * This is required to ensure that we don't advance the xmin
> + * of CONFLICT_DETECTION_SLOT even if one of the
> subscriptions
> + * is not enabled. Otherwise, we won't be able to detect
>
> I guess the "even" in the first sentence is not necessary.
Agreed.
>
> ---
> +/*
> + * Determine the minimum non-removable transaction ID across all apply
> +workers
> + * for subscriptions that have retain_conflict_info enabled. Store the
> +result
> + * in *xmin.
> + *
> + * If the replication slot cannot be advanced during this cycle, due to
> +either
> + * a disabled subscription or an inactive worker, set *can_advance_xmin
> +to
> + * false.
> + */
> +static void
> +compute_min_nonremovable_xid(LogicalRepWorker *worker,
> + bool retain_conflict_info, TransactionId *xmin,
> + bool *can_advance_xmin)
>
> I think this function is quite confusing for several reasons. For instance, it's
> doing more things than described in the comments such as trying to create the
> CONFLICT_DETECTION_SLOT if no worker is passed. Also, one of the caller
> describes:
>
> + /*
> + * This is required to ensure that we don't advance the xmin
> + * of CONFLICT_DETECTION_SLOT even if one of the
> subscriptions
> + * is not enabled. Otherwise, we won't be able to detect
> + * conflicts reliably for such a subscription even though it
> + * has set the retain_conflict_info option.
> + */
> + compute_min_nonremovable_xid(NULL,
> sub->retainconflictinfo,
> + &xmin, &can_advance_xmin);
>
> but it's unclear to me from the function name that it tries to create the
> replication slot. Furthermore, in this path it doesn't actually compute xmin. I
> guess we can try to create CONFLICT_DETECTION_SLOT in the loop of
> "foreach(lc, sublist)" and set false to can_advance_xmin if either the
> subscription is disabled or the worker is not running.
I understand. The original code was similar to your suggestion, but we decided
to encapsulate it within a separate function to maintain a clean and concise
main loop. However, your suggestion also makes sense, so I will proceed with
the change.
>
> ---
> + FullTransactionId remote_oldestxid; /* oldest transaction ID that was in
> + * the commit phase on the
> publisher.
> + * Use FullTransactionId to prevent
> + * issues with transaction ID
> + * wraparound, where a new
> + * remote_oldestxid could falsely
> + * appear to originate from the past
> + * and block advancement */
> + FullTransactionId remote_nextxid; /* next transaction ID to be assigned
> + * on the publisher. Use
> + * FullTransactionId for consistency
> + * and to allow straightforward
> + * comparisons with
> + remote_oldestxid. */
>
> I think it would be readable if we could write them above each field.
Will adjust.
Best Regards,
Hou zj
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2025-07-07 02:24:13 | Re: support ALTER COLUMN SET EXPRESSION over virtual generated column with check constraint |
Previous Message | wenhui qiu | 2025-07-07 02:11:57 | Re: Making pg_rewind faster |