Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Oh, Mike" <minsoo(at)amazon(dot)com>
Subject: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Date: 2022-07-28 11:57:26
Message-ID: CAA4eK1Jt4v37D4XBOu=h+tu+wzijpjh1Bw5h-_eeLEkRbxrSEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 28, 2022 at 3:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Jul 26, 2022 at 1:22 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > Okay, I've attached an updated patch that does the above idea. Could
> > you please do the performance tests again to see if the idea can help
> > reduce the overhead, Shi yu?
> >
>
> While reviewing the patch for HEAD, I have changed a few comments. See
> attached, if you agree with these changes then include them in the
> next version.
>

I have another comment on this patch:
SnapBuildPurgeOlderTxn()
{
...
+ if (surviving_xids > 0)
+ memmove(builder->catchange.xip, &(builder->catchange.xip[off]),
+ surviving_xids * sizeof(TransactionId))
...

For this code to hit, we must have a situation where one or more of
the xacts in this array must be still running. And, if that is true,
we would not have started from the restart point where the
corresponding snapshot (that contains the still running xacts) has
been serialized because we advance the restart point to not before the
oldest running xacts restart_decoding_lsn. This may not be easy to
understand so let me take an example to explain. Say we have two
transactions t1 and t2, and both have made catalog changes. We want a
situation where one of those gets purged and the other remains in
builder->catchange.xip array. I have tried variants of the below
sequence to see if I can get into the required situation but am not
able to make it.

Session-1
Checkpoint -1;
T1
DDL

Session-2
T2
DDL

Session-3
Checkpoint-2;
pg_logical_slot_get_changes()
-- Here when we serialize the snapshot corresponding to
CHECKPOINT-2's running_xact record, we will serialize both t1 and t2
as catalog-changing xacts.

Session-1
T1
Commit;

Checkpoint;
pg_logical_slot_get_changes()
-- Here we will restore from Checkpoint-1's serialized snapshot and
won't be able to move restart_point to Checkpoint-2 because T2 is
still open.

Now, as per my understanding, it is only possible to move
restart_point to Checkpoint-2 if T2 gets committed/rolled-back in
which case we will never have that in surviving_xids array after the
purge.

It is possible I am missing something here. Do let me know your thoughts.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2022-07-28 12:47:10 Re: Hash index build performance tweak from sorting
Previous Message Andrey Borodin 2022-07-28 11:46:01 Re: [Commitfest 2022-07] Patch Triage: Waiting on Author