Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Oh, Mike" <minsoo(at)amazon(dot)com>
Subject: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date: 2022-06-13 02:58:57
Message-ID: CAD21AoBmJny+KkaLLtwbz+i6cThUQ=eGFmjeWZb0rdvbmPXN3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 7, 2022 at 9:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, May 30, 2022 at 11:13 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, May 25, 2022 at 12:11 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> >
> > poc_add_regression_tests.patch adds regression tests for this bug. The
> > regression tests are required for both HEAD and back-patching but I've
> > separated this patch for testing the above two patches easily.
> >

Thank you for the comments.

>
> Few comments on the test case patch:
> ===============================
> 1.
> +# For the transaction that TRUNCATEd the table tbl1, the last decoding decodes
> +# only its COMMIT record, because it starts from the RUNNING_XACT
> record emitted
> +# during the first checkpoint execution. This transaction must be marked as
> +# catalog-changes while decoding the COMMIT record and the decoding
> of the INSERT
> +# record must read the pg_class with the correct historic snapshot.
> +permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate"
> "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert"
> "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
>
> Will this test always work? What if we get an additional running_xact
> record between steps "s0_commit" and "s0_begin" that is logged via
> bgwriter? You can mimic that by adding an additional checkpoint
> between those two steps. If we do that, the test will pass even
> without the patch because I think the last decoding will start
> decoding from this new running_xact record.

Right. It could pass depending on the timing but doesn't fail
depending on the timing. I think we need to somehow stop bgwriter to
make the test case stable but it seems unrealistic. Do you have any
better ideas?

>
> 2.
> +step "s1_get_changes" { SELECT data FROM
> pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
> 'include-xids', '0'); }
>
> It is better to skip empty transactions by using 'skip-empty-xacts' to
> avoid any transaction from a background process like autovacuum. We
> have previously seen some buildfarm failures due to that.

Agreed.

>
> 3. Did you intentionally omit the .out from the test case patch?

No, I'll add .out file in the next version patch.

>
> 4.
> This transaction must be marked as
> +# catalog-changes while decoding the COMMIT record and the decoding
> of the INSERT
> +# record must read the pg_class with the correct historic snapshot.
>
> /marked as catalog-changes/marked as containing catalog changes

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-06-13 03:05:51 "buffer too small" or "path too long"?
Previous Message Julien Rouhaud 2022-06-13 02:32:13 Re: Add header support to text format and matching feature