Quick Links

RE: Conflict detection for update_deleted in logical replication

From:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	RE: Conflict detection for update_deleted in logical replication
Date:	2025-09-12 03:25:00
Message-ID:	TY4PR01MB16907A4AA86C504E2DC5088BD9408A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Friday, September 12, 2025 2:39 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> > > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu)
> > > > > <houzj(dot)fnst(at)fujitsu(dot)com>
> > > > > wrote:
> > > > >
> > > > > Few other comments:
> > > > > 1.
> > > > > + /*
> > > > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > > > + *
> > > > > + * It might seem feasible to directly check the conflict
> > > > > + detection
> > > > > + * slot.xmin instead of relying on the launcher to assign the
> > > > > + worker's
> > > > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > > > + condition
> > > > > + * where slot.xmin is set to InvalidTransactionId immediately
> > > > > + after the
> > > > > + * check. In such cases, oldest_nonremovable_xid would no
> > > > > + longer be
> > > > > + * protected by a replication slot and could become unreliable
> > > > > + if a
> > > > > + * wraparound occurs.
> > > > > + */
> > > > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > > > + return;
> > > > >
> > > > > I understand the reason why we get assigned the worker's
> > > > > non_removable_xid from launcher but all this makes the code
> > > > > complex to understand. Isn't there any other way to achieve it?
> > > > > One naïve and inefficient way could be to just restart the
> > > > > worker probably after updating its retentionactive flag. I am not
> suggesting to make this change but just a brainstorming point.
> > > >
> > > > I'll keep thinking about it, and for now, I've added a comment
> > > > mentioning that rebooting is a simpler solution.
> > > >
> > >
> > > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED
> > > mode as soon as we encounter the first subscription with
> > > retain_dead_tuples set during the traversal of the sublist. We
> > > release it only after updating xmin outside the sublist traversal.
> > > We then acquire it in EXCLUSIVE mode to fetch the resume the
> > > retention in worker for the period till we fetch slot's xmin.
> > >
> > > This will close the above race condition but acquiring LWLock while
> > > traversing subscription is not advisable as that will make it
> > > uninterruptible. The other possibility is to use some heavy-weight
> > > lock here, say a lock on pg_subscription catalog but that has a
> > > drawback that it can conflict with DDL operations. Yet another way
> > > is to invent a new lock-type for this.
> > >
> > > OTOH, the simple strategy to let apply-worker restart to resume
> > > retention will keep the handling simpler. We do something similar at
> > > the start of apply-worker if we find that some subscription
> > > parameter is changed. I think we need more opinions on this matter.
> >
> > IMHO the situation of retention being disabled and re-enabled is not a
> > common occurrence. It typically happens in specific scenarios where
> > there's a significant replication lag or the user has not configured
> > the retention timeout correctly. Because these are corner cases, I
> > believe we should avoid over-engineering a solution and simply restart
> > the worker, as Amit suggested.
>
> +1
>
> While it's ideal if workers could initialize their oldest_nonremovable_xid values
> on-the-fly, I believe we can begin with the simple solution given that stopping
> and resuming retaining of conflict info would not happen so often. In fact,
> frequent stops and restarts would typically be a sign that users might be not
> configuring the options properly for their systems. IIUC if the workers are able to
> do that, we can support to activate retain_conflict_info even for enabled
> subscriptions. I think we can leave it for future improvements if necessary.

I agree. Here is a V73 patch that will restart the worker if the retention
resumes. I also addressed other comments posted by Amit[1].

[1] https://www.postgresql.org/message-id/CAA4eK1LA7mnvKT9L8Nx_h%2B0TCvq-Ob2BGPO1bQKhkGHtoZKsow%40mail.gmail.com

Best Regards,
Hou zj

Attachment	Content-Type	Size
v73-0001-Allow-conflict-relevant-data-retention-to-resume.patch	application/octet-stream	17.9 KB

In response to

Re: Conflict detection for update_deleted in logical replication at 2025-09-11 18:39:14 from Masahiko Sawada

Responses

Re: Conflict detection for update_deleted in logical replication at 2025-09-12 08:42:31 from Amit Kapila
Re: Conflict detection for update_deleted in logical replication at 2025-09-12 08:47:31 from shveta malik

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Naga Appani	2025-09-12 03:32:59	Re: [Proposal] Expose internal MultiXact member count function for efficient monitoring
Previous Message	Dilip Kumar	2025-09-12 03:15:36	Re: Incorrect logic in XLogNeedsFlush()