RE: Conflict detection for update_deleted in logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: RE: Conflict detection for update_deleted in logical replication
Date: 2025-08-29 06:22:53
Message-ID: TY4PR01MB1690721034ADBE0447A3FFA3C943AA@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, August 28, 2025 6:07 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> On Thu, Aug 28, 2025 at 8:02 AM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> >
> > I noticed that Cfbot failed to compile the document due to a typo
> > after renaming the subscription option. Here are the updated V67
> > patches to fix that, only the doc in 0001 is modified.
> >
>
> Please find a few comments:

Thanks for the comments.

>
> patch001 :
> 1)
> Assert is hit while testing patch001 alone:
> TRAP: failed Assert("TransactionIdIsValid(nonremovable_xid)"), File:
> "launcher.c", Line: 1394, PID: 55350.
>
> Scenario:
> I have 2 subs, both have stopped retention. Now on one of the sub, if I do this:
>
> --switch off rdt:
> alter subscription sub1 disable;
> alter subscription sub1 set (retain_dead_tuples=off); alter subscription sub1
> enable;
>
> --switch back rdt, launcher asserts after this alter subscription sub1 disable;
> alter subscription sub1 set (retain_dead_tuples=on); alter subscription sub1
> enable;

I fixed it by skipping collecting the xid from workers when the overall
retention has been stopped.

When fixing the reported Assert failure, I noticed another race condition that
can hit this Assert: If the launcher has obtained
pg_subscription.retentionactive and has not yet called
compute_min_nonremovable_xid(), then if the apply worker stops retention and
sets oldest_nonremovable_xid to invalid, the launcher will hit the assertion
again when calling compute_min_nonremovable_xid(). I fixed it by adding a check
in compute function:

/*
* Return if the apply worker has stopped retention concurrently.
*
* Although this function is invoked only when retentionactive is true, the
* apply worker might stop retention after the launcher fetches the
* retentionactive flag.
*/
if (!TransactionIdIsValid(nonremovable_xid))
return;

The logic in 0002 is also adjusted to avoid assigning the slot.xmin to worker in this
case.

>
> 2)
> + Maximum duration for which this subscription's apply worker
> is allowed
> + to retain the information useful for conflict detection when
> + <literal>retain_dead_tuples</literal> is enabled for the
> associated
> + subscriptions.
>
> associated subscriptions --> associated subscription (since we are talking
> about 'this subscription's apply worker')

I have reworded this part based on Amit's suggestion.

>
> 3)
> + The information useful for conflict detection is no longer
> retained if
> + all apply workers associated with the subscriptions, where
> + <literal>retain_dead_tuples</literal> is enabled, confirm that the
> + retention duration exceeded the
>
> A trivial thing: 'retention duration has exceeded' sounds better to me.

Changed

>
> ~~
>
> patch002:
> (feel free to defer these comments if we are focusing on patch001 right now):
>
> 4)
> stop_conflict_info_retention() has the correct message and detail in
> patch01 while in patch02, it is switched back to the older one (wrong one).
> Perhaps some merge mistake

Right, It a miss when rebasing the 0002. Fixed.

>
> 5)
> resume_conflict_info_retention() still refers to the old GUC name:
> max_conflict_retention_duration.

Fixed.

>
> 6)
> In compute_min_nonremovable_xid(), shall we have
> Assert(TransactionIdIsValid(nonremovable_xid)) before assigning it to
> worker->oldest_nonremovable_xid? Here:
>
> + nonremovable_xid = MyReplicationSlot->data.xmin;
> +
> + SpinLockAcquire(&worker->relmutex);
> + worker->oldest_nonremovable_xid = nonremovable_xid;
> + SpinLockRelease(&worker->relmutex);

Added.

> 7)
> Now since compute_min_nonremovable_xid() is also taking care of assigning
> valid xid to the worker, shall we update that in comment as well? We can have
> one more line:
> 'Additionally if any of the apply workers has invalid xid, assign slot's xmin to it.'

Added.

Best Regards,
Hou zj

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2025-08-29 06:23:50 Re: Mark ItemPointer parameters as const in tuple/table lock functions
Previous Message Zhijie Hou (Fujitsu) 2025-08-29 06:19:44 RE: Conflict detection for update_deleted in logical replication