Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Oh, Mike" <minsoo(at)amazon(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
Subject: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date: 2021-07-29 08:25:01
Message-ID: CAD21AoCxVNYK0R=SSPchcx1wm6mx=qJ5W8qDVakFCOAy1UxqAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 16, 2021 at 1:35 AM Oh, Mike <minsoo(at)amazon(dot)com> wrote:
>
> Sending this to pgsql-hackers list to create a CommitFest entry with the attached patch proposal.
>
>
>
> Hello,
>
> We noticed that the logical replication could fail when the Standby::RUNNING_XACT record is generated in the middle of a catalog modifying transaction and if the logical decoding has to restart from the RUNNING_XACT
>
> WAL entry.
>
> The Standby::RUNNING_XACT record is generated periodically (roughly every 15s by default) or during a CHECKPOINT operation.
>
>
>
> Detailed problem description:
>
> Tested on 11.8 & current master.
>
> The logical replication slot restart_lsn advances in the middle of an open txn that modified the catalog (e.g. TRUNCATE operation).
>
> Should the logical decoding has to restart it could fail with an error like this:
>
> ERROR: could not map filenode "base/13237/442428"

Thank you for reporting the issue.

I could reproduce this issue by the steps you shared.

>
> Currently, the system relies on processing Heap2::NEW_CID to keep track of catalog modifying (sub)transactions.
>
> This context is lost if the logical decoding has to restart from a Standby::RUNNING_XACTS that is written between the NEW_CID record and its parent txn commit.
>
> If the logical stream restarts from this restart_lsn, then it doesn't have the xid responsible for modifying the catalog.
>

I agree with your analysis. Since we don’t use commit WAL record to
track the transaction that has modified system catalogs, if we decode
only the commit record of such transaction, we cannot know the
transaction has been modified system catalogs, resulting in the
subsequent transaction scans system catalog with the wrong snapshot.

With the patch, if the commit WAL record has a XACT_XINFO_HAS_INVALS
flag and its xid is included in RUNNING_XACT record written at
restart_lsn, we forcibly add the top XID and its sub XIDs as a
committed transaction that has modified system catalogs to the
snapshot. I might be missing something about your patch but I have
some comments on this approach:

1. Commit WAL record may not have invalidation message for system
catalogs (e.g., when commit record has only invalidation message for
relcache) even if it has XACT_XINFO_HAS_INVALS flag. In this case, the
transaction wrongly is added to the snapshot, is that okay?

2. We might add a subtransaction XID as a committed transaction that
has modified system catalogs even if it actually didn't. As the
comment in SnapBuildBuildSnapshot() describes, we track only the
transactions that have modified the system catalog and store in the
snapshot (in the ‘xip' array). The patch could break that assumption.
However, I’m really not sure how to deal with this point. We cannot
know which subtransaction has actually modified system catalogs by
using only the commit WAL record.

3. The patch covers only the case where the restart_lsn exactly
matches the LSN of RUNNING_XACT. I wonder if there could be a case
where the decoding starts at a WAL record other than RUNNING_XACT but
the next WAL record is RUNNING_XACT.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2021-07-29 08:40:00 Re: ExecRTCheckPerms() and many prunable partitions
Previous Message Kyotaro Horiguchi 2021-07-29 07:59:21 Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep