Quick Links

RE: Assertion failure in SnapBuildInitialSnapshot()

From:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Pradeep Kumar <spradeepkumar29(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	RE: Assertion failure in SnapBuildInitialSnapshot()
Date:	2025-12-19 03:19:15
Message-ID:	TY4PR01MB16907F9A1B31C9542D3AC616894A9A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Friday, December 19, 2025 3:42 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> > On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
> > > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > > > On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada
> > > <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > Right. But the following scenario seems to happen:
> > > > >
> > > > > 1. Both processes have a slot with effective_catalog_xmin = 100.
> > > > > 2. Process-A updates effective_catalog_xmin to 150, and computes
> > > > > the
> > > new
> > > > > catalog_xmin as 100 because process-B slot still has
> > > effective_catalog_xmin =
> > > > > 100.
> > > > > 3. Process-B updates effective_catalog_xmin to 150, and computes
> > > > > the
> > > new
> > > > > catalog_xmin as 150.
> > > > > 4. Process-B updates procArray->replication_slot_catalog_xmin to
> 150.
> > > > > 5. Process-A updates procArray->replication_slot_catalog_xmin to
> 100.
> > > >
> > > > I think this scenario can occur, but is not harmful. Because the
> > > > catalog rows removed prior to xid:150 would no longer be used, as
> > > > both slots have
> > > advanced
> > > > their catalog_xmin and flushed the value to disk. Therefore, even
> > > > if replication_slot_catalog_xmin regresses, it should be OK.
> > > >
> > > > Considering all above, I think allowing concurrent xmin
> > > > computation, as the patch does, is acceptable. What do you think ?
> > >
> > > I agree with your analysis. Another thing I'd like to confirm is
> > > that in an extreme case, if the server crashes suddenly after
> > > removing catalog tuples older than XID 100 and logical decoding
> > > restarts, it ends up missing necessary catalog tuples? I think it's
> > > not a problem as long as the subscriber knows the next commit LSN
> > > they want but could it be problematic if the user switches to use
> > > the logical decoding SQL API? I might be worrying too much, though.
> >
> > I think this case is not a problem because:
> >
> > In LogicalConfirmReceivedLocation, the updated restart_lsn and
> > catalog_xmin are flushed to disk before the effective_catalog_xmin is
> > updated. Thus, once replication_slot_catalog_xmin advances to a
> > certain value, even in the event of a crash, users won't encounter any
> > removed tuples when consuming from slots after a restart. This is
> > because all slots have their updated restart_lsn flushed to disk,
> > ensuring that upon restarting, changes are decoded from the updated
> position where older catalog tuples are no longer needed.
>
> Agreed.
>
> >
> > BTW, I assume you meant catalog tuples older than XID 150 are removed,
> > since in the previous example, tuples older than XID 100 are already not
> useful.
>
> Right. Thank you for pointing this out.
>
> I think we can proceed with the idea proposed by Hou-san. Regarding the
> patch, while it mostly looks good, it might be worth considering adding more
> comments:
>
> - If the caller passes already_locked=true to
> ReplicationSlotsComputeRequiredXmin(), the lock order should also be
> considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
> - ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
> process, resulting in temporarily moving
> procArray->replication_slot_catalog_xmin backward, but it's harmless
> because a smaller catalog_xmin is conservative: it merely prevents VACUUM
> from removing catalog tuples that could otherwise be pruned. It does not lead
> to premature deletion of required data.

Thanks for the comments. I added some more comments as suggested.

Here is the updated patch.

Best Regards,
Hou zj

Attachment	Content-Type	Size
v5HEAD-0001-Fix-a-race-condition-of-updating-procArray-re.patch	application/octet-stream	8.5 KB

In response to

Re: Assertion failure in SnapBuildInitialSnapshot() at 2025-12-18 19:42:18 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Marcos Magueta	2025-12-19 03:25:51	Re: WIP - xmlvalidate implementation from TODO list
Previous Message	Fujii Masao	2025-12-19 03:17:01	Re: pgsql: Add function to log the memory contexts of specified backend pro