Re: Reorderbuffer crash during recovery

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reorderbuffer crash during recovery
Date: 2019-11-10 11:18:01
Message-ID: CALDaNm2eKs1t4k1GBnr2tjhijgSLcABksHvLV=O93ihTOO-z+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > Hi,
> > >
> > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > > <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > > > >
> > > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > > patches, or with clean master too?
> > > > >
> > > >
> > > > This occurs with the clean master. This is a base code problem
> > > > revealed while doing stress testing of logical_work_mem patches.
> > >
> > > As far as I can tell there are no repro steps included? Any chance to
> > > get those?
> > >
>
> I think it will be bit tricky to write a test case, but the callstack
> indicates the problem. For ex. for issue-1, see the below part of
> callstack,
>
> #2 0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
> fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
>
> The Assertion failure indicates that the changes list is empty when we
> are trying to get 'lsn' of the last change. I think this is possible
> when immediately after serializing the transaction changes the server
> got restarted (say it crashed or somebody switched off and restarted
> it). The reason is that after serializing the changes, the changes
> list will be empty and serialized flag for txn will be set. I am not
> if there is any reason to set the final_lsn if the changes list is
> empty.
>
> Similarly, if we see the call stack of issue-2, the problem is clear.
>
> #2 0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
> fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
>
> I think this also has a symptom similar to the prior issue but for
> sub-transactions. The ReorderBufferAbortOld() tries to set the
> final_lsn of toplevel transaction, but not for subtransactions, later
> in that path ReorderBufferRestoreCleanup expects it to be set even for
> subtransaction. Is that Assert in ReorderBufferRestoreCleanup()
> required, because immediately after assert, we are anyway setting the
> value of final_lsn.
>

Thanks Amit for more analysis of the issues.

> >
> > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.
> >
> > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.
> >
> > To simplify the reorderbuffer crash, I used the following steps:
> > 1) Make replication setup with publisher/subscriber for some table
> > 2) Prepare a sql file with the below:
> > begin;
> > 4096 insert statements;
> > select pg_sleep(120)
> > 3) Execute the above script.
> > 4) Attach the postgres process when pg_sleep is in progress.
> > 5) call abort() from attached gdb.
> >
>
> Isn't it important to call this abort immediately after the changes
> are serialized as explained above?
>

As we are performing exactly 4096 insert statements on a fresh
replication setup it will get serialized immediately after 4096
inserts. The variable max_changes_in_memory is initialized with 4096
(reorderbuffer.c file), txn->nentries_mem gets incremented for every
insert statement in ReorderBufferQueueChange function. As we have
executed 4096 insert statements, txn->nentries_mem will become 4096
which results in the serializing from ReorderBufferCheckSerializeTXN
function. Then attaching gdb and call abort() helped in reproducing
the issue consistently.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message James Coleman 2019-11-10 14:15:06 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message Thomas Munro 2019-11-10 03:43:49 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2019-11-10 12:00:44 Re: Ordering of header file inclusion
Previous Message Michael Paquier 2019-11-10 10:52:05 Re: [bug fix] Produce a crash dump before main() on Windows