Re: Conflict detection for update_deleted in logical replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2025-09-10 08:39:34
Message-ID: CAA4eK1LA7mnvKT9L8Nx_h+0TCvq-Ob2BGPO1bQKhkGHtoZKsow@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> > wrote:
> >
> > Few other comments:
> > 1.
> > + /*
> > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > + *
> > + * It might seem feasible to directly check the conflict detection
> > + * slot.xmin instead of relying on the launcher to assign the worker's
> > + * oldest_nonremovable_xid; however, that could lead to a race
> > + condition
> > + * where slot.xmin is set to InvalidTransactionId immediately after the
> > + * check. In such cases, oldest_nonremovable_xid would no longer be
> > + * protected by a replication slot and could become unreliable if a
> > + * wraparound occurs.
> > + */
> > + if (!TransactionIdIsValid(nonremovable_xid))
> > + return;
> >
> > I understand the reason why we get assigned the worker's non_removable_xid
> > from launcher but all this makes the code complex to understand. Isn't there
> > any other way to achieve it? One naïve and inefficient way could be to just
> > restart the worker probably after updating its retentionactive flag. I am not
> > suggesting to make this change but just a brainstorming point.
>
> I'll keep thinking about it, and for now, I've added a comment mentioning
> that rebooting is a simpler solution.
>

Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode
as soon as we encounter the first subscription with retain_dead_tuples
set during the traversal of the sublist. We release it only after
updating xmin outside the sublist traversal. We then acquire it in
EXCLUSIVE mode to fetch the resume the retention in worker for the
period till we fetch slot's xmin.

This will close the above race condition but acquiring LWLock while
traversing subscription is not advisable as that will make it
uninterruptible. The other possibility is to use some heavy-weight
lock here, say a lock on pg_subscription catalog but that has a
drawback that it can conflict with DDL operations. Yet another way is
to invent a new lock-type for this.

OTOH, the simple strategy to let apply-worker restart to resume
retention will keep the handling simpler. We do something similar at
the start of apply-worker if we find that some subscription parameter
is changed. I think we need more opinions on this matter.

One other comment:
+ if (!MySubscription->retentionactive)
+ {
+ rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;
+ process_rdt_phase_transition(rdt_data, false);
+ return;
+ }

It is better that the caller processes the next phase, otherwise, this
looks a bit ad hoc. Similarly, please check other places.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alyona Vinter 2025-09-10 09:07:35 Re: Timeline switching with partial WAL records can break replica recovery
Previous Message Chao Li 2025-09-10 08:24:13 Re: Checkpointer write combining