| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
|---|---|
| To: | Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> |
| Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shvetamalik(at)gmail(dot)com> |
| Subject: | Re: Proposal: Conflict log history table for Logical Replication |
| Date: | 2026-06-04 07:25:22 |
| Message-ID: | CAFiTN-v5ZJu6pLts269oyPzCGPVQKscOPyLw60UOBGB6yMmSYQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Jun 3, 2026 at 3:25 PM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
>
> On Wed, Jun 3, 2026 at 10:11 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Tue, 2 Jun 2026 at 10:51, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > PFA, latest version of patch
> > >
> > > Changes:
> > > 1. Includes Nisha's suggestion from v43_comments_nisha.patch
> > > 2. Include Amit's suggestion from v43-0001_amit.1.patch
> > > 3. Include Amit's suggestion: change GetLogDestination to
> > > GetConflictLogDest(), drop_sub_conflict_log_table() to
> > > drop_sub_conflict_log_table(), CONFLICTS_LOGGED_TO_FILE to
> > > CONFLICTS_LOGGED_TO_LOG and renamed IsConflictNamespace() to
> > > IsConflictLogTableNamespace()
> > > 4. Include Vignesh's fix Parallel_apply_failure.patch
> > > 5. Change conflict log table name back to conflict_log_table_<subid>
> > > 6. Changed acl to exclude INSERT/UPDATE/USAGE as it was there in initial patch
> > > 7. Merged 0002 to 0001 and 0004 to 0003 after review
> > >
> > > Open
> > > 1. Vignesh please rebase upgrade and \dRs+ page
> >
> > Here is the rebased version of the upgrade and \dRs patch which is
> > present in v45-0003 and v45-0004. There is no change in v45-0001 and
> > v45-0002, they are the same patch as in [1].
>
> Thanks for the updated patches, please find my comments on the v44
> patches below. These appear to apply to v45-001 and v45-002 as well,
> as both are unchanged.
>
> 1) The patches require a catversion bump. It may be worth adding it
> now so it does not get missed later during commit.
Yeah we can do this
>
> 2) As ProcessPendingConflictLogTuple() may perform a heap_insert, it
> could fail for various reasons, such as table access errors, write
> failures, WAL write failures, or ownership-related issues.
> I looked into the error handling when CLT insert fails:
>
> 2a) When disable_on_error = false
>
> /* Try to allocate a worker for the streaming transaction. */
> @@ -5666,6 +5674,7 @@ start_apply(XLogRecPtr origin_startpos)
> */
> AbortOutOfAnyTransaction();
> pgstat_report_subscription_error(MySubscription->oid);
> + ProcessPendingConflictLogTuple();
>
> PG_RE_THROW();
> }
>
> If ProcessPendingConflictLogTuple() fails, PG_RE_THROW() is never
> reached, so the original conflict error is lost and only the secondary
> CLT-related error is reported, e.g.:
>
> ERROR: could not open relation with OID 16421
> LOG: background worker "logical replication apply worker" exited
> with exit code 1
>
> If the failure persists, the apply worker will keep retrying and users
> may never see the actual conflict error. The same concern applies to
> the parallel worker case too.
>
> Is there a clean way to ensure the original apply/conflict error is
> still logged even if ProcessPendingConflictLogTuple() itself fails?
Does this happening when destination is table or with all as well?
> ~~
>
> 2b) When disable_on_error = true i.e. DisableSubscriptionAndExit() path
>
> @@ -6040,6 +6049,8 @@ DisableSubscriptionAndExit(void)
> */
> pgstat_report_subscription_error(MyLogicalRepWorker->subid);
>
> + ProcessPendingConflictLogTuple();
> +
> /* Disable the subscription */
> StartTransactionCommand();
>
> This introduces a potential failure point before the subscription is
> disabled. If ProcessPendingConflictLogTuple() fails, the subscription
> may never be disabled and the worker could repeatedly hit the same
> error.
>
> Would it be safer to move ProcessPendingConflictLogTuple() until after
> the subscription has been disabled, so that disabling the subscription
> is not dependent on successful CLT processing?
Yeah, I think it make sense, I don't see obvious issue with moving it
after disabling the subscription.
>
> 2c) When "conflict_log_destination = table" the log reports:
>
> ERROR: conflict detected on relation "public.t1": conflict=insert_exists
> DETAIL: Conflict details are logged to the conflict log table:
> pg_conflict_log_16403
>
> However, the subsequent CLT write can still fail:
> ...
> ERROR: could not open relation with OID 16426
> LOG: background worker "logical replication apply worker" exited
> with exit code 1
>
> Since the CLT write has not yet occurred when this message is emitted,
> would it make sense to change it to:
> "Conflict details will be logged to the conflict log table: ..."
>
> to avoid implying that the logging has already succeeded?
IMHO "Conflict details are logged to the conflict log table" doesn't
mean a conflict has been logged to the table; what I interpret from
this message is that conflict logging to the table is enabled.
> ~~~
>
> 3) The v45-002 commit message still refers to the old conflict log
> table name, pg_conflict.pg_conflict_log_for_subid_16396. It should be
> updated to pg_conflict_log_16396.
> ~~~
>
> 4) 035_conflicts.pl: I see that v45-002 has corrected the CLT name in
> below test -
> +# Verify the contents of the Conflict Log Table (CLT)
> +# This section ensures that the CLT contains the expected
> +# type and specific key data.
> +my $subid = $node_subscriber->safe_psql('postgres',
> + "SELECT oid FROM pg_subscription WHERE subname = 'sub_tab';");
> +my $clt = "pg_conflict.pg_conflict_log_$subid";
> +
> +# Wait for the conflict to be logged in the CLT
> +my $log_check = $node_subscriber->poll_query_until(
> + 'postgres',
> + "SELECT count(*) > 0 FROM $clt;"
> +);
>
> I wonder why this wasn't caught in the v44 tests. It seems the check
> below is silently broken because poll_query_until() keeps retrying
> until timeout when the table does not exist.
> Should we replace it with an ok() test instead? something like -
>
> # Wait for the conflict to be logged in the CLT
> ok($node_subscriber->poll_query_until(
> 'postgres',
> "SELECT count(*) > 0 FROM $clt;"),
> 'conflict appeared in CLT after insert');
Why would CLT not exist, IIUC this is positive test where we are
trying to validate the conflict data from CLT, no?
--
Regards,
Dilip Kumar
Google
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Sharma | 2026-06-04 07:36:09 | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Previous Message | Jingtang Zhang | 2026-06-04 07:19:24 | Re: Optimize CPU usage of dropping buffers during recovery |