RE: test_decoding assertion failure for the loss of top-sub transaction relationship

From: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "dilipbalaut(at)gmail(dot)com" <dilipbalaut(at)gmail(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: test_decoding assertion failure for the loss of top-sub transaction relationship
Date: 2022-09-06 07:47:21
Message-ID: TYAPR01MB58660803BCAA7849C8584AA4F57E9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear hackers,

> I agreed both that DEBUG2 messages are still useful but we should not
> change the condition for output. So I prefer the idea suggested by Amit.

PSA newer patch, which contains the fix and test.

> > I have not verified but I think we need to backpatch this till 14
> > because prior to that in DecodeCommit, we use to set the top-level txn
> > as having catalog changes based on the if there are invalidation
> > messages in the commit record. So, in the current scenario shared by
> > Osumi-San, before SnapBuildCommitTxn(), the top-level txn will be
> > marked as having catalog changes.
>
> I and Osumi-san are now investigating that, so please wait further reports and
> patches.

We investigated it about older versions, and in some versions *another stack-trace* has been found.

About PG10-13, indeed, the failure was not occurred.
In these versions transactions are regarded as
that have catalog changes when the commit record has XACT_XINFO_HAS_INVALS flag.
This flag will be set if the transaction has invalidation messages.

When sub transaction changes system catalogs and user commits,
all invalidation messages allocated in sub transaction will be transferred to top transaction.
Therefore both transactions will be marked as containing catalog changes.

About PG14 and 15, however, another stack-trace has been found.
While executing the same workload, we got followings at the same SQL statement;

```
(gdb) backtrace
#0 0x00007fa78c6dc387 in raise () from /lib64/libc.so.6
#1 0x00007fa78c6dda78 in abort () from /lib64/libc.so.6
#2 0x0000000000b16680 in ExceptionalCondition (conditionName=0xcd3ab0 "txn->ninvalidations == 0", errorType=0xcd3284 "FailedAssertion",
fileName=0xcd32d0 "reorderbuffer.c", lineNumber=2936) at assert.c:69
#3 0x00000000008e9e70 in ReorderBufferForget (rb=0x12b5b10, xid=735, lsn=24125384) at reorderbuffer.c:2936
#4 0x00000000008d9493 in DecodeCommit (ctx=0x12a2d20, buf=0x7ffe08b236b0, parsed=0x7ffe08b23510, xid=734, two_phase=false) at decode.c:733
#5 0x00000000008d8962 in DecodeXactOp (ctx=0x12a2d20, buf=0x7ffe08b236b0) at decode.c:279
#6 0x00000000008d85e2 in LogicalDecodingProcessRecord (ctx=0x12a2d20, record=0x12a30e0) at decode.c:142
#7 0x00000000008dfef2 in pg_logical_slot_get_changes_guts (fcinfo=0x129acb0, confirm=true, binary=false) at logicalfuncs.c:296
#8 0x00000000008e002f in pg_logical_slot_get_changes (fcinfo=0x129acb0) at logicalfuncs.c:365
...
(gdb) frame 4
#4 0x00000000008d9493 in DecodeCommit (ctx=0x14cfd20, buf=0x7ffc638b0ca0, parsed=0x7ffc638b0b00, xid=734, two_phase=false) at decode.c:733
733 ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
(gdb) list
728 */
729 if (DecodeTXNNeedSkip(ctx, buf, parsed->dbId, origin_id))
730 {
731 for (i = 0; i < parsed->nsubxacts; i++)
732 {
733 ReorderBufferForget(ctx->reorder, parsed->subxacts[i], buf->origptr);
734 }
735 ReorderBufferForget(ctx->reorder, xid, buf->origptr);
736
737 return;
(gdb) frame 3
#3 0x00000000008e9e70 in ReorderBufferForget (rb=0x14e2b10, xid=735, lsn=24125152) at reorderbuffer.c:2936
2936 Assert(txn->ninvalidations == 0);
(gdb) list
2931 */
2932 if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
2933 ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
2934 txn->invalidations);
2935 else
2936 Assert(txn->ninvalidations == 0);
2937
2938 /* remove potential on-disk data, and deallocate */
2939 ReorderBufferCleanupTXN(rb, txn);
2940 }
(gdb) print *txn
$1 = {txn_flags = 3, xid = 735, toplevel_xid = 734, gid = 0x0, first_lsn = 24113488, final_lsn = 24125152, end_lsn = 0, toptxn = 0x14ecb98,
restart_decoding_lsn = 24113304, origin_id = 0, origin_lsn = 0, commit_time = 0, base_snapshot = 0x0, base_snapshot_lsn = 0,
base_snapshot_node = {prev = 0x14ecc00, next = 0x14e2b28}, snapshot_now = 0x0, command_id = 4294967295, nentries = 5, nentries_mem = 5,
changes = {head = {prev = 0x14eecf8, next = 0x14eeb18}}, tuplecids = {head = {prev = 0x14ecb10, next = 0x14ecb10}}, ntuplecids = 0,
tuplecid_hash = 0x0, toast_hash = 0x0, subtxns = {head = {prev = 0x14ecb38, next = 0x14ecb38}}, nsubtxns = 0, ninvalidations = 3,
invalidations = 0x14e2d28, node = {prev = 0x14ecc68, next = 0x14ecc68}, size = 452, total_size = 452, concurrent_abort = false,
output_plugin_private = 0x0}
```

In these versions DecodeCommit() said OK. However, we have met another failure
because the ReorderBufferTXN of the sub transaction had invalidation messages but it did not have base_snapshot.

I thought that this failure was occurred the only the base_snapshot of the sub transaction is transferred via ReorderBufferTransferSnapToParent()
when a transaction is assigned as child, but its invalidation messages are not.

I was not sure what's the proper way to fix it.
The solution I've thought at first was transporting all invalidations from sub to top like ReorderBufferTransferSnapToParent(),
but I do not know its side effect. Moreover, how do we deal with ReorderBufferChange?
Should we transfer them too? If so, how about the ordering of changes?
Alternative solustion was just remove the assertion, but was it OK?

How do you think? I want to hear your comments and suggestions...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
HEAD-0001-Fix-assertion-failure-during-logical-decoding.patch application/octet-stream 5.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhang Mingli 2022-09-06 07:50:08 Re: Remove dead macro exec_subplan_get_plan
Previous Message Antonin Houska 2022-09-06 07:26:08 Return value of PathNameOpenFile()