Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Maxim Orlov <orlovmg(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Date: 2022-11-22 12:34:07
Message-ID: CAD21AoA8v_WR6iVqVayoMvPzf5JZuHp1QLwnFdnqNq-shgfWAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Nov 22, 2022 at 6:37 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Nov 21, 2022 at 6:17 PM Maxim Orlov <orlovmg(at)gmail(dot)com> wrote:
> >
> > PROBLEM
> >
> > After some investigation, I think, the problem is in the snapbuild.c (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
> > array in the context of builder->context, but for the time when we call SnapBuildPurgeOlderTxn this context may be already free'd.
> >
>
> I think you are seeing it freed in SnapBuildPurgeOlderTxn when we
> finish and restart decoding in the same session. After finishing the
> first decoding, it frees the decoding context but we forgot to reset
> NInitialRunningXacts and InitialRunningXacts array. So, next time when
> we start decoding in the same session where we don't restore any
> serialized snapshot, it can lead to the problem you are seeing because
> NInitialRunningXacts (and InitialRunningXacts array) won't have sane
> values.
>
> This can happen in the catalog_change_snapshot test as we have
> multiple permutations and those use the same session across a restart
> of decoding.

I have the same analysis. In order to restart the decoding from the
LSN where we don't restore any serialized snapshot, we somehow need to
advance the slot's restart_lsn. In this case, I think it happened
since the same session drops at the end of the first scenario and
creates the replication slot with the same name at the beginning of
the second scenario in catalog_change_snapshot.spec.

>
> >
> > Simple fix like:
> > @@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
> > * changes. See SnapBuildXidSetCatalogChanges.
> > */
> > NInitialRunningXacts = nxacts;
> > - InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
> > + InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz);
> > memcpy(InitialRunningXacts, running->xids, sz);
> > qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);
> >
> > seems to solve the described problem, but I'm not in the context of [0] and why array is allocated in builder->context.
> >
>
> It will leak the memory for InitialRunningXacts. We need to reset
> NInitialRunningXacts (and InitialRunningXacts) as mentioned above.
>
> Thank you for the report and initial analysis. I have added Sawada-San
> to know his views as he was the primary author of this work.

Thanks!

I've attached a draft patch. To fix it, I think we can reset
InitialRunningXacts and NInitialRunningXacts at FreeSnapshotBuilder()
and add an assertion in AllocateSnapshotBuilder() to make sure both
are reset. Regarding the tests, the patch includes a new scenario to
reproduce this issue. However, since the issue can be reproduced also
by the existing scenario (with low probability, though), I'm not sure
it's worth adding the new scenario.

I've not checked if the patch works for version 14 or older yet.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
reset_initial_running_xacts.patch application/octet-stream 6.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-11-22 12:38:28 Re: Introduce a new view for checkpointer related stats
Previous Message Amit Kapila 2022-11-22 12:22:31 Re: Logical Replication Custom Column Expression