Re: Reorderbuffer crash during recovery

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reorderbuffer crash during recovery
Date: 2019-11-09 11:37:42
Message-ID: CAA4eK1JPQu7RhMS+L1adicA4tfoWP+pgnDtR_LEMz0z33aXbzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > >
> > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > patches, or with clean master too?
> > > >
> > >
> > > This occurs with the clean master. This is a base code problem
> > > revealed while doing stress testing of logical_work_mem patches.
> >
> > As far as I can tell there are no repro steps included? Any chance to
> > get those?
> >

I think it will be bit tricky to write a test case, but the callstack
indicates the problem. For ex. for issue-1, see the below part of
callstack,

#2 0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332

The Assertion failure indicates that the changes list is empty when we
are trying to get 'lsn' of the last change. I think this is possible
when immediately after serializing the transaction changes the server
got restarted (say it crashed or somebody switched off and restarted
it). The reason is that after serializing the changes, the changes
list will be empty and serialized flag for txn will be set. I am not
if there is any reason to set the final_lsn if the changes list is
empty.

Similarly, if we see the call stack of issue-2, the problem is clear.

#2 0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

I think this also has a symptom similar to the prior issue but for
sub-transactions. The ReorderBufferAbortOld() tries to set the
final_lsn of toplevel transaction, but not for subtransactions, later
in that path ReorderBufferRestoreCleanup expects it to be set even for
subtransaction. Is that Assert in ReorderBufferRestoreCleanup()
required, because immediately after assert, we are anyway setting the
value of final_lsn.

>
> This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.
>
> This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.
>
> To simplify the reorderbuffer crash, I used the following steps:
> 1) Make replication setup with publisher/subscriber for some table
> 2) Prepare a sql file with the below:
> begin;
> 4096 insert statements;
> select pg_sleep(120)
> 3) Execute the above script.
> 4) Attach the postgres process when pg_sleep is in progress.
> 5) call abort() from attached gdb.
>

Isn't it important to call this abort immediately after the changes
are serialized as explained above?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message James Coleman 2019-11-09 13:23:23 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message Tomas Vondra 2019-11-09 10:01:17 Re: BUG #16105: trying to install pgadmin4

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2019-11-09 11:52:21 Re: Performance improvement for queries with IN clause
Previous Message Dent John 2019-11-09 10:51:58 Re: The flinfo->fn_extra question, from me this time.