Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date: 2022-10-13 07:08:40
Message-ID: CAA4eK1Lmx6MygvCOL3Xncn4CQ3gEDhuAiMK0cBkMyvMORrgwrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Summarizing this issue, the assertion check in AssertTXNLsnOrder()
> fails as reported because the current logical decoding cannot properly
> handle the case where the decoding restarts from NEW_CID. Since we
> don't make the association between top-level transaction and its
> subtransaction while decoding NEW_CID (ie, in
> SnapBuildProcessNewCid()), two transactions are created in
> ReorderBuffer as top-txn and have the same LSN. This failure happens
> on all supported versions.
>
> To fix the problem, one idea is that we make the association between
> top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
> as Tomas proposed. On the other hand, since we don't guarantee to make
> the association between the top-level transaction and its
> sub-transactions until we try to decode the actual contents of the
> transaction, it makes sense to me that instead of trying to solve by
> making association, we need to change the code which are assuming that
> it is associated.
>
> I've attached the patch for this idea. With the patch, we skip the
> assertion checks in AssertTXNLsnOrder() until we reach the LSN at
> which we start decoding the contents of transaction, ie.
> start_decoding_at in SnapBuild. The minor concern is other way that
> the assertion check could miss some faulty cases where two unrelated
> top-transactions could have same LSN. With this patch, it will pass
> for such a case. Therefore, for transactions that we skipped checking,
> we do the check when we reach the LSN.
>

>
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@
LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *recor
buf.origptr);
}

+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check the order of transaction LSNs when we reached the start decoding
+ * LSN. See the comments in AssertTXNLsnOrder() for details.
+ */
+ if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+ AssertTXNLsnOrder(ctx->reorder);
+#endif
+
rmgr = GetRmgr(XLogRecGetRmid(record));
>

I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.

Now, say we have commit records for multiple transactions which are
after start_decoding_at but all their changes are before
start_decoding_at, then we won't check their ordering at commit time
but OTOH, we would have checked their ordering before restart. Isn't
that sufficient?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message osumi.takamichi@fujitsu.com 2022-10-13 07:16:39 possible typo for CREATE PUBLICATION description
Previous Message Michael Paquier 2022-10-13 07:06:20 Re: Add mssing test to test_decoding/meson.build