From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Subject: | RE: Conflict detection for update_deleted in logical replication |
Date: | 2025-09-12 03:25:00 |
Message-ID: | TY4PR01MB16907A4AA86C504E2DC5088BD9408A@TY4PR01MB16907.jpnprd01.prod.outlook.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Friday, September 12, 2025 2:39 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> > > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu)
> > > > > <houzj(dot)fnst(at)fujitsu(dot)com>
> > > > > wrote:
> > > > >
> > > > > Few other comments:
> > > > > 1.
> > > > > + /*
> > > > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > > > + *
> > > > > + * It might seem feasible to directly check the conflict
> > > > > + detection
> > > > > + * slot.xmin instead of relying on the launcher to assign the
> > > > > + worker's
> > > > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > > > + condition
> > > > > + * where slot.xmin is set to InvalidTransactionId immediately
> > > > > + after the
> > > > > + * check. In such cases, oldest_nonremovable_xid would no
> > > > > + longer be
> > > > > + * protected by a replication slot and could become unreliable
> > > > > + if a
> > > > > + * wraparound occurs.
> > > > > + */
> > > > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > > > + return;
> > > > >
> > > > > I understand the reason why we get assigned the worker's
> > > > > non_removable_xid from launcher but all this makes the code
> > > > > complex to understand. Isn't there any other way to achieve it?
> > > > > One naïve and inefficient way could be to just restart the
> > > > > worker probably after updating its retentionactive flag. I am not
> suggesting to make this change but just a brainstorming point.
> > > >
> > > > I'll keep thinking about it, and for now, I've added a comment
> > > > mentioning that rebooting is a simpler solution.
> > > >
> > >
> > > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED
> > > mode as soon as we encounter the first subscription with
> > > retain_dead_tuples set during the traversal of the sublist. We
> > > release it only after updating xmin outside the sublist traversal.
> > > We then acquire it in EXCLUSIVE mode to fetch the resume the
> > > retention in worker for the period till we fetch slot's xmin.
> > >
> > > This will close the above race condition but acquiring LWLock while
> > > traversing subscription is not advisable as that will make it
> > > uninterruptible. The other possibility is to use some heavy-weight
> > > lock here, say a lock on pg_subscription catalog but that has a
> > > drawback that it can conflict with DDL operations. Yet another way
> > > is to invent a new lock-type for this.
> > >
> > > OTOH, the simple strategy to let apply-worker restart to resume
> > > retention will keep the handling simpler. We do something similar at
> > > the start of apply-worker if we find that some subscription
> > > parameter is changed. I think we need more opinions on this matter.
> >
> > IMHO the situation of retention being disabled and re-enabled is not a
> > common occurrence. It typically happens in specific scenarios where
> > there's a significant replication lag or the user has not configured
> > the retention timeout correctly. Because these are corner cases, I
> > believe we should avoid over-engineering a solution and simply restart
> > the worker, as Amit suggested.
>
> +1
>
> While it's ideal if workers could initialize their oldest_nonremovable_xid values
> on-the-fly, I believe we can begin with the simple solution given that stopping
> and resuming retaining of conflict info would not happen so often. In fact,
> frequent stops and restarts would typically be a sign that users might be not
> configuring the options properly for their systems. IIUC if the workers are able to
> do that, we can support to activate retain_conflict_info even for enabled
> subscriptions. I think we can leave it for future improvements if necessary.
I agree. Here is a V73 patch that will restart the worker if the retention
resumes. I also addressed other comments posted by Amit[1].
Best Regards,
Hou zj
Attachment | Content-Type | Size |
---|---|---|
v73-0001-Allow-conflict-relevant-data-retention-to-resume.patch | application/octet-stream | 17.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Naga Appani | 2025-09-12 03:32:59 | Re: [Proposal] Expose internal MultiXact member count function for efficient monitoring |
Previous Message | Dilip Kumar | 2025-09-12 03:15:36 | Re: Incorrect logic in XLogNeedsFlush() |