Re: Conflict detection for update_deleted in logical replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2025-09-11 18:39:14
Message-ID: CAD21AoDox9K9Vnhssj+S0ovW2bJFsNddXY1A-MWXVVfbim2NJg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> > > > wrote:
> > > >
> > > > Few other comments:
> > > > 1.
> > > > + /*
> > > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > > + *
> > > > + * It might seem feasible to directly check the conflict detection
> > > > + * slot.xmin instead of relying on the launcher to assign the worker's
> > > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > > + condition
> > > > + * where slot.xmin is set to InvalidTransactionId immediately after the
> > > > + * check. In such cases, oldest_nonremovable_xid would no longer be
> > > > + * protected by a replication slot and could become unreliable if a
> > > > + * wraparound occurs.
> > > > + */
> > > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > > + return;
> > > >
> > > > I understand the reason why we get assigned the worker's non_removable_xid
> > > > from launcher but all this makes the code complex to understand. Isn't there
> > > > any other way to achieve it? One naïve and inefficient way could be to just
> > > > restart the worker probably after updating its retentionactive flag. I am not
> > > > suggesting to make this change but just a brainstorming point.
> > >
> > > I'll keep thinking about it, and for now, I've added a comment mentioning
> > > that rebooting is a simpler solution.
> > >
> >
> > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode
> > as soon as we encounter the first subscription with retain_dead_tuples
> > set during the traversal of the sublist. We release it only after
> > updating xmin outside the sublist traversal. We then acquire it in
> > EXCLUSIVE mode to fetch the resume the retention in worker for the
> > period till we fetch slot's xmin.
> >
> > This will close the above race condition but acquiring LWLock while
> > traversing subscription is not advisable as that will make it
> > uninterruptible. The other possibility is to use some heavy-weight
> > lock here, say a lock on pg_subscription catalog but that has a
> > drawback that it can conflict with DDL operations. Yet another way is
> > to invent a new lock-type for this.
> >
> > OTOH, the simple strategy to let apply-worker restart to resume
> > retention will keep the handling simpler. We do something similar at
> > the start of apply-worker if we find that some subscription parameter
> > is changed. I think we need more opinions on this matter.
>
> IMHO the situation of retention being disabled and re-enabled is not a
> common occurrence. It typically happens in specific scenarios where
> there's a significant replication lag or the user has not configured
> the retention timeout correctly. Because these are corner cases, I
> believe we should avoid over-engineering a solution and simply restart
> the worker, as Amit suggested.

+1

While it's ideal if workers could initialize their
oldest_nonremovable_xid values on-the-fly, I believe we can begin with
the simple solution given that stopping and resuming retaining of
conflict info would not happen so often. In fact, frequent stops and
restarts would typically be a sign that users might be not configuring
the options properly for their systems. IIUC if the workers are able
to do that, we can support to activate retain_conflict_info even for
enabled subscriptions. I think we can leave it for future improvements
if necessary.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-09-11 18:41:54 Re: plan shape work
Previous Message Tom Lane 2025-09-11 18:19:40 Re: plan shape work