Re: Assertion failure in SnapBuildInitialSnapshot()

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Pradeep Kumar <spradeepkumar29(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Date: 2025-12-09 23:24:51
Message-ID: CAD21AoDeNxDNqms0jQOiFOuvB78sg6r48XLRi+Kg7a7C9U9pOA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> > wrote:
> > >
> > > On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada
> > <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > >
> > > > Given that the computation of xmin and catalog_xmin among all slots
> > > > could be executed concurrently, could the following scenario happen
> > > > where
> > > > procArray->replication_slot_xmin and replication_slot_catalog_xmin
> > > > procArray->are retreat to a non-invalid
> > > > XID?
> > > >
> > > > 1. Suppose the initial value procArray->replication_slot_catalog_xmin is
> > 50.
> > > > 2. Process-A updates its owned slot's catalog_xmin to 100, and
> > > > computes the new catalog_xmin as 100 while holding
> > > > ReplicationSlotControlLock in a shared mode in
> > > > ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
> > procArray's catalog_xmin value yet.
> > > > 3. Process-B updates its owned slot's catalog_xmin to 150, and
> > > > computes the new catalog_xmin as 150.
> > > > 4. Process-B updates the procArray->replication_slot_catalog_xmin to
> > 150.
> > > > 5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
> > > > 100, which was 150.
> > >
> > > After further investigation, I think that steps 3 and 4 cannot occur
> > > because Process-B must have already encountered the catalog_xmin
> > > maintained by Process-A, either 50 or 100. Consequently, Process-B
> > > will refrain from updating the catalog_xmin to a more recent value, such as
> > 150.
> >
> > Right. But the following scenario seems to happen:
> >
> > 1. Both processes have a slot with effective_catalog_xmin = 100.
> > 2. Process-A updates effective_catalog_xmin to 150, and computes the new
> > catalog_xmin as 100 because process-B slot still has effective_catalog_xmin =
> > 100.
> > 3. Process-B updates effective_catalog_xmin to 150, and computes the new
> > catalog_xmin as 150.
> > 4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
> > 5. Process-A updates procArray->replication_slot_catalog_xmin to 100.
>
> I think this scenario can occur, but is not harmful. Because the catalog rows
> removed prior to xid:150 would no longer be used, as both slots have advanced
> their catalog_xmin and flushed the value to disk. Therefore, even if
> replication_slot_catalog_xmin regresses, it should be OK.
>
> Considering all above, I think allowing concurrent xmin computation, as the
> patch does, is acceptable. What do you think ?

I agree with your analysis. Another thing I'd like to confirm is that
in an extreme case, if the server crashes suddenly after removing
catalog tuples older than XID 100 and logical decoding restarts, it
ends up missing necessary catalog tuples? I think it's not a problem
as long as the subscriber knows the next commit LSN they want but
could it be problematic if the user switches to use the logical
decoding SQL API? I might be worrying too much, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Wong 2025-12-09 23:28:59 updates for handling optional argument in system functions
Previous Message Dmitry Koval 2025-12-09 23:22:53 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands