Quick Links

RE: Conflict detection for update_deleted in logical replication

From:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	RE: Conflict detection for update_deleted in logical replication
Date:	2025-09-16 04:38:49
Message-ID:	TY4PR01MB16907805DE4816E53C54708159414A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tuesday, September 16, 2025 11:54 AM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Monday, September 15, 2025 8:11 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Sep 15, 2025 at 1:07 PM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > Thanks, the changes look good to me. I have merged them in V75 patch.
> > >
> >
> > Pushed. I see a BF which is not related with this commit but a
> > previous commit for the work in this thread.
> >
> > See LOGs [1]:
> > regress_log_035_conflicts
> > -----------------------------------
> > [11:16:47.604](0.015s) not ok 24 - the deleted column is removed
> > [11:16:47.605](0.002s) # Failed test 'the deleted column is removed'
> > # at
> > /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/035_confl
> > ict
> > s.pl
> > line 562.
> >
> > Then the corresponding subscriber LOG:
> >
> > 025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > vacuuming "postgres.public.tab"
> > 2025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > finished vacuuming "postgres.public.tab": index scans: 0
> > pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly
> > scanned
> > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples
> > missed: 1 dead from 1 pages not removed due to cleanup lock contention
> > removable
> > cutoff: 787, which was 0 XIDs old when operation ended ...
> >
> > This indicates that the Vacuum is not able to remove the row even
> > after the slot is advanced because some other background process holds
> > a lock/pin on the page. I think that is possible because the page was
> > dirty due to apply of update operation and bgwriter/checkpointer could try to
> write such a page.
> >
> > I'll analyze more tomorrow and share if I have any new findings.
>
> I agree with the analysis. I attempted to delete a tuple from a table and, while
> executing VACUUM(verbose) on this table, I executed a checkpoint
> concurrently.
> Using the debugger, I stoped in SyncOneBuffer() after acquiring the page
> block.
> This allowed me to reproduce the same log where the deleted row could not be
> removed:
>
> --
> tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples missed: 1
> dead from 1 pages not removed due to cleanup lock contention
> --
>
> I think we can remove the VACUUM for removing the deleted column. We have
> already confirmed that the replication slot.xmin has advanced, which should
> be sufficient to prove that the feature works correctly.

Apart from above fix, I noticed another BF failure[1].

--
timed out waiting for match: (?^:logical replication worker for subscription "tap_sub_a_b" will resume retaining the information for detecting conflicts
--

It is clear from the log[2] that the apply worker resumes retention immediately
after the synchronized_standby_slots configuration is removed, but before the
max_retention_duration is set to 0. We expected resumption to occur only after
adjusting max_retention_duration to 0, thus overlooking the log. To ensure
stability, we should postpone the removal of synchronized_standby_slots until
setting max_retention_duration to 0.

I can reproduce it locally by adding "sleep 10;" after resetting the
synchronized_standby_slots GUC and before resume test

I updated the patch to fix this as well.

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2025-09-15%2009%3A07%3A45&stg=subscription-check

[2]
2025-09-15 11:53:41.450 UTC [3604:23] LOG: logical replication worker for subscription "tap_sub_a_b" will resume retaining the information for detecting conflicts
2025-09-15 11:53:41.450 UTC [3604:24] DETAIL: Retention is re-enabled as the apply process is advancing its xmin within the configured max_retention_duration of 1 ms.

Best Regards,
Hou zj

Attachment	Content-Type	Size
v2-0001-Stablize-the-tests-in-035_conflicts.patch	application/octet-stream	4.1 KB

In response to

RE: Conflict detection for update_deleted in logical replication at 2025-09-16 03:53:38 from Zhijie Hou (Fujitsu)

Responses

Re: Conflict detection for update_deleted in logical replication at 2025-09-16 05:42:59 from shveta malik

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Shubham Khanna	2025-09-16 04:52:43	Re: Add support for specifying tables in pg_createsubscriber.
Previous Message	Hayato Kuroda (Fujitsu)	2025-09-16 04:37:15	RE: [Patch] add new parameter to pg_replication_origin_session_setup