Reorderbuffer crash during recovery

From: vignesh C <vignesh21(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Reorderbuffer crash during recovery
Date: 2019-11-06 11:50:02
Message-ID: CALDaNm2CLk+K9JDwjYST0sPbGg5AQdvhUt0jbKyX_HdAE0jk3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi,

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

I'm not sure of the fix for this. If I get time I will try to spend
more time to find out the fix.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dilip Kumar 2019-11-06 12:10:52 Re: Reorderbuffer crash during recovery
Previous Message Thomas Butz 2019-11-06 11:48:16 Re: BUG #16095: Segfault while executing trigger

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2019-11-06 12:10:52 Re: Reorderbuffer crash during recovery
Previous Message Dilip Kumar 2019-11-06 11:29:37 Re: [HACKERS] Block level parallel vacuum