RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Oh, Mike" <minsoo(at)amazon(dot)com>
Subject: RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date: 2021-10-08 03:07:00
Message-ID: OSBPR01MB4888FAB98EAA23CA3FEE1CE7EDB29@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, October 7, 2021 1:20 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> Regarding the patch details, I have two comments:
>
> ---
> + if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) && last_running) {
> + /* make last_running->xids bsearch()able */
> + qsort(last_running->xids,
> + last_running->subxcnt + last_running->xcnt,
> + sizeof(TransactionId), xidComparator);
>
> The patch does qsort() every time when the commit message has
> XACT_XINFO_HAS_INVALS. IIUC the xids we need to remember is the only
> xids that are recorded in the first replayed XLOG_RUNNING_XACTS, right? If so,
> we need to do qsort() once, can remove xid from the array once it gets
> committed, and then can eventually make last_running empty so that we can
> skip even TransactionIdInArray().
>
> ---
> Since last_running is allocated by malloc() and it isn't freed even after finishing
> logical decoding.
>
>
> Another idea to fix this problem would be that before calling
> SnapBuildCommitTxn() we create transaction entries in ReorderBuffer for
> (sub)transactions whose COMMIT record has XACT_XINFO_HAS_INVALS,
> and then mark all of them as catalog-changed by calling
> ReorderBufferXidSetCatalogChanges(). I've attached a PoC patch for this idea.
> What the patch does is essentially the same as what the proposed patch does.
> But the patch doesn't modify the SnapBuildCommitTxn(). And we remember
> the list of last running transactions in reorder buffer and the list is periodically
> purged during decoding RUNNING_XACTS records, eventually making it
> empty.
Thanks for the patch.

Conducted a quick check of the POC.

Test of check-world PASSED with your patch and head.
Also, the original scenario described in [1] looks fine
with your revised patch and LOG_SNAPSHOT_INTERVAL_MS expansion in the procedure.

The last command in the provided steps showed below.

postgres=# select * from pg_logical_slot_get_changes('bdt_slot', null, null);
lsn | xid | data
-----------+-----+----------------------------------------
0/1560020 | 710 | BEGIN 710
0/1560020 | 710 | table public.bdt: INSERT: a[integer]:1
0/1560140 | 710 | COMMIT 710

Minor comments for DecodeStandbyOp changes I noticed instantly
(1) minor suggestion of your comment.

+ * has done catalog changes without these records, we miss to add
+ * the xid to the snapshot so up creating the wrong snapshot. To

"miss to add" would be miss adding or fail to add.
And "up creating" is natural in this sentence ?

(2) a full-width space between "it'" and "s" in the next sentence.

+ * mark an xid that actually has not done that but it’s not a

[1] - https://www.postgresql.org/message-id/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com

Best Regards,
Takamichi Osumi

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-10-08 03:28:58 Re: Add jsonlog log_destination for JSON server logs
Previous Message Stephen Frost 2021-10-08 03:05:21 Re: storing an explicit nonce