Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Oh, Mike" <minsoo(at)amazon(dot)com>
Subject: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date: 2022-08-29 06:17:28
Message-ID: CAA4eK1LXfDKjPCqhe0Sw_OeaVr31WecHGmt+TtdxCZeMuCFQzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 27, 2022 at 7:06 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Sat, Aug 27, 2022 at 7:24 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Sat, Aug 27, 2022 at 1:06 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > > >
> > > > > I think then we should change this code in the master branch patch
> > > > > with an additional comment on the lines of: "Either all the xacts got
> > > > > purged or none. It is only possible to partially remove the xids from
> > > > > this array if one or more of the xids are still running but not all.
> > > > > That can happen if we start decoding from a point (LSN where the
> > > > > snapshot state became consistent) where all the xacts in this were
> > > > > running and then at least one of those got committed and a few are
> > > > > still running. We will never start from such a point because we won't
> > > > > move the slot's restart_lsn past the point where the oldest running
> > > > > transaction's restart_decoding_lsn is."
> > > > >
> > > >
> > > > Unfortunately, this theory doesn't turn out to be true. While
> > > > investigating the latest buildfarm failure [1], I see that it is
> > > > possible that only part of the xacts in the restored catalog modifying
> > > > xacts list needs to be purged. See the attached where I have
> > > > demonstrated it via a reproducible test. It seems the point we were
> > > > missing was that to start from a point where two or more catalog
> > > > modifying were serialized, it requires another open transaction before
> > > > both get committed, and then we need the checkpoint or other way to
> > > > force running_xacts record in-between the commit of initial two
> > > > catalog modifying xacts. There could possibly be other ways as well
> > > > but the theory above wasn't correct.
> > > >
> > >
> > > Thank you for the analysis and the patch. I have the same conclusion.
> > > Since we took this approach only on the master the back branches are
> > > not affected.
> > >
> > > The new test scenario makes sense to me and looks better than the one
> > > I have. Regarding the fix, I think we should use
> > > TransactionIdFollowsOrEquals() instead of
> > > NormalTransactionIdPrecedes():
> > >
> > > + for (off = 0; off < builder->catchange.xcnt; off++)
> > > + {
> > > + if (NormalTransactionIdPrecedes(builder->catchange.xip[off],
> > > + builder->xmin))
> > > + break;
> > > + }
> > >
> >
> > Right, fixed.
>
> Thank you for updating the patch! It looks good to me.
>

Pushed.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2022-08-29 06:29:27 Re: Handle infinite recursion in logical replication setup
Previous Message Ajin Cherian 2022-08-29 06:14:57 Re: Support logical replication of DDLs