Re: Don't clean up LLVM state when exiting in a bad way

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Jelte Fennema <Jelte(dot)Fennema(at)microsoft(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Don't clean up LLVM state when exiting in a bad way
Date: 2021-09-14 05:05:23
Message-ID: 0B6E6A15-4611-451B-936C-E8846CC8E847@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On September 13, 2021 9:00:00 PM PDT, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>Hello hackers,
>14.09.2021 04:32, Andres Freund wrote:
>> On 2021-09-07 14:44:39 -0500, Justin Pryzby wrote:
>>> On Tue, Sep 07, 2021 at 12:27:27PM -0700, Andres Freund wrote:
>>>> I think this is a tad too strong. We should continue to clean up on exit as
>>>> long as the error didn't happen while we're already inside llvm
>>>> code. Otherwise we loose some ability to find leaks. How about checking in the
>>>> error path whether fatal_new_handler_depth is > 0, and skipping cleanup in
>>>> that case? Because that's precisely when it should be unsafe to reenter
>>>> LLVM.
>> The more important reason is actually profiling information that needs to be
>> written out.
>>
>> I've now pushed a fix to all relevant branches. Thanks all!
>>
>I've encountered similar issue last week, but found this discussion only
>after the commit.
>I'm afraid that it's not completely gone yet. I've reproduced a similar
>crash (on edb4d95d) with
>echo "statement_timeout = 50
>jit_optimize_above_cost = 1
>jit_inline_above_cost = 1
>parallel_setup_cost=0
>parallel_tuple_cost=0
>" >/tmp/extra.config
>TEMP_CONFIG=/tmp/extra.config  make check
>
>parallel group (11 tests):  memoize explain hash_part partition_info
>reloptions tuplesort compression partition_aggregate indexing
>partition_prune partition_join
>     partition_join               ... FAILED (test process exited with
>exit code 2)     1815 ms
>     partition_prune              ... FAILED (test process exited with
>exit code 2)     1779 ms
>     reloptions                   ... ok          146 ms
>
>I've extracted the crash-causing fragment from the partition_prune test
>to reproduce the segfault reliably (see the patch attached).
>The segfault stack is:
>Core was generated by `postgres: parallel worker for PID
>12029                                       '.
>Program terminated with signal 11, Segmentation fault.
>#0  0x00007f045e0a88ca in notifyFreed (K=<optimized out>, Obj=...,
>this=<optimized out>)
>    at
>/usr/src/debug/llvm-7.0.1.src/lib/ExecutionEngine/Orc/OrcCBindingsStack.h:485
>485           Listener->NotifyFreeingObject(Obj);
>(gdb) bt
>#0  0x00007f045e0a88ca in notifyFreed (K=<optimized out>, Obj=...,
>this=<optimized out>)
>    at
>/usr/src/debug/llvm-7.0.1.src/lib/ExecutionEngine/Orc/OrcCBindingsStack.h:485
>#1  operator() (K=<optimized out>, Obj=..., __closure=<optimized out>)
>    at
>/usr/src/debug/llvm-7.0.1.src/lib/ExecutionEngine/Orc/OrcCBindingsStack.h:226
>#2  std::_Function_handler<void (unsigned long, llvm::object::ObjectFile
>const&),
>llvm::OrcCBindingsStack::OrcCBindingsStack(llvm::TargetMachine&,
>std::function<std::unique_ptr<llvm::orc::IndirectStubsManager,
>std::default_delete<llvm::orc::IndirectStubsManager> >
>()>)::{lambda(unsigned long, llvm::object::ObjectFile
>const&)#3}>::_M_invoke(std::_Any_data const&, unsigned long,
>llvm::object::ObjectFile const&) (__functor=..., __args#0=<optimized
>out>, __args#1=...)
>    at /usr/include/c++/4.8.2/functional:2071
>#3  0x00007f045e0aa578 in operator() (__args#1=..., __args#0=<optimized
>out>, this=<optimized out>)
>    at /usr/include/c++/4.8.2/functional:2471
>...
>
>The corresponding code in OrcCBindingsStack.h is:
>void notifyFreed(orc::VModuleKey K, const object::ObjectFile &Obj) {
>    for (auto &Listener : EventListeners)
>     Listener->NotifyFreeingObject(Obj);
>}
>So probably one of the EventListeners has become null. I see that
>without debugging and profiling enabled the only listener registration
>in the postgres code is LLVMOrcRegisterJITEventListener.
>
>With LLVM 9 on the same Centos 7 I don't get such segfault. Also it
>doesn't happen on different OSes with LLVM 7.

That just like an llvm bug to me. Rather than the usage issue addressed in this thread.

I still have no
>explanation for that, but maybe there is difference between LLVM
>configure options, e.g. like this:
>https://stackoverflow.com/questions/47712670/segmentation-fault-in-llvm-pass-when-using-registerstandardpasses

Why is it not much more likely that bugs were fixed?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-09-14 05:08:40 RE: Added schema level support for publication.
Previous Message Amul Sul 2021-09-14 05:04:09 Re: TAP test for recovery_end_command