Re: Reorderbuffer crash during recovery

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reorderbuffer crash during recovery
Date: 2019-12-11 05:43:04
Message-ID: CALDaNm3ArR_4a+_6KZWUocvFcwCsrO-y-kOSgOY76c6U7HUHkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

> I found couple of crashes in reorderbuffer while review/testing of
> logical_work_mem and logical streaming of large in-progress
> transactions. Stack trace of the same are given below:
> Issue 1:
> #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> #2 0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
> fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
> #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> record=0x3342770) at decode.c:121
> #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> <XLogSendLogical>) at walsender.c:2199
> #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> walsender.c:1128
> #10 0x0000000000b6ce83 in exec_replication_command
> (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> (proto_version '1', publication_names '\"pub1\"')")
> at walsender.c:1545
> #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> postgres.c:4256
> #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> postmaster.c:1400
> #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
>
> Issue 2:
> #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2 0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
> fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
> #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> From initial analysis it looks like:
> Issue1 it seems like if all the reorderbuffer has been flushed and
> then the server restarts. This problem occurs.
> Issue 2 it seems like if there are many subtransactions present and
> then the server restarts. This problem occurs. The subtransaction's
> final_lsn is not being set and when ReorderBufferRestoreCleanup is
> called the assert fails. May be for this we might have to set the
> subtransaction's final_lsn before cleanup(not sure).
>
> I could not reproduce this issue consistently with a test case, But I
> felt this looks like a problem from review.
>
> For issue1, I could reproduce by the following steps:
> 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> 2) Have many open transactions with subtransactions open.
> 3) Attach one of the transaction from gdb and call abort().
>
> I'm not sure of the fix for this. If I get time I will try to spend
> more time to find out the fix.

I have further analyzed the issue and found that:
After abort is called, when the process restarts, it will clean the
reorder information for the aborted transactions in
ReorderBufferAbortOld function. It crashes in the below code as there
are no changes present currently and all the changes are serialized:
.......
if (txn->serialized && txn->final_lsn == 0))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
.......

It sets the final_lsn here so that it can iterate from the start_lsn
to final_lsn and cleanup the serialized files in
ReorderBufferRestoreCleanup function. One solution We were thinking
was to store the lsn of the last serialized change while serialiizing
and set the final_lsn in the above case where it crashes like the
below code:
......
if (txn->serialized && txn->final_lsn == 0 &&
!dlist_is_empty(&txn->changes))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
else
{
/*
* If there are no changes present as all of the changes were
* serialized, use the last lsn that was serialized.
*/
txn->final_lsn = txn->current_serialized_lsn;
}
......

I have tested the same scenario and verified it to be working. The
patch for the same is attached. I could not add a test case for this
as it involves attaching to gdb and calling abort.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-Reorder-buffer-crash-while-aborting-old-transactions.patch text/x-patch 2.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Drouvot, Bertrand 2019-12-11 08:17:01 Re: logical decoding bug: segfault in ReorderBufferToastReplace()
Previous Message Michael Paquier 2019-12-11 05:27:45 Re: logical replication: could not create file "state.tmp": File exists

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-12-11 05:48:24 Re: Windows buildfarm members vs. new async-notify isolation test
Previous Message Kyotaro Horiguchi 2019-12-11 05:37:37 Re: Start Walreceiver completely before shut down it on standby server.