Re: BUG #16971: Incompatible datalayout errors with llvmjit

From: Tom Stellard <tstellar(at)redhat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: BUG #16971: Incompatible datalayout errors with llvmjit
Date: 2021-04-20 21:42:28
Message-ID: 2c613fbc-7196-42de-d861-e22b96edf231@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 4/20/21 12:29 PM, Andres Freund wrote:
> Hi,
>
> Thanks to tgl for pointing me this thread...
>
> On 2021-04-19 18:29:52 +0000, PG Bug reporting form wrote:
>> In our Fedora builds, we are getting errors[1] in the postgresql tests due
>> to incompatible datalayouts between the JIT engine and the LLVM modules
>> being compiled. The problem is that the JIT engine is being created with
>> host specific CPU and features, while the datalayout for the compiled module
>> is being taken from llvmjit_types.bc which is compiled without any specified
>> CPU type or features.
>
> It's very odd that features would change the data layout - analogizing
> with plain C code that'd mean that you cannot link a binary compiled
> with something like -mavx2 against a library compiled without. To me
> this smells like a bug somewhere lower level.
>

You are correct that is odd, and to be honest, I didn't think that LLVM
targets were allowed to change the datalayout based on the CPU type.
I'm checking with upstream LLVM to see if this allowed or not. However,
this behavior is present in at least LLVM 11 and LLVM 12 (I haven't
checked earlier versions), so postgresql will have to deal with this
somehow.

> Reformatting the error yields:
> ERROR: failed to JIT module: Added modules have incompatible data layouts:
> E-m:e-i1:8:16-i8:8:16-i64:64-f128:64- a:8:16-n32:64 (module) vs
> E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)
>
> The -v128:64 is about how to align vectors. Skimming the relevant LLVM
> code I don't see why it'd be included in JIted code but not native code.
>

The relevant code in LLVM is here:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp#L88

> Just to be sure, I take
> checking for llvm-config... /usr/bin/llvm-config
> checking for clang... /usr/bin/clang
> are for llvm & clang compiled from the same code?
>

Yes, they are same.

> I temporarily had access to s390x to debug an unrelated issue in the
> past, and there I did hit this problem, but IIRC only because of clang
> vs llvm version mismatches.
>
> Unfortunately it is a bit hard to debug without access to a s390x
> box... Can you provide that? If not I might ask the Debian folks for
> access to one of their porter machines.
>

You can demonstrate this issue on an x86_64 machine using clang:

# Equivalent to how llvmjit_types.bc is compiled:
[root(at)8bbe06c6258e ~]# echo "void foo(){}" | clang --target=s390x-- -x c -S -emit-llvm - -o - | grep datalayout
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64"

# Equivalent to how JIT'd code is compiled:
[root(at)8bbe06c6258e ~]# echo "void foo(){}" | clang --target=s390x-- -march=z13 -x c -S -emit-llvm - -o - | grep datalayout
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64"

> FWIW, the Debian s390x build seems to succeed at the moment:
> https://buildd.debian.org/status/fetch.php?pkg=postgresql-13&arch=s390x&ver=13.2-1&stamp=1613044202&raw=0
>

The Fedora builders are z13 systems. If Debian is using zEC12, then this issue
would not happen, since zEC12 does not use the vector ABI:

echo "void foo(){}" | clang --target=s390x-- -march=zEC12 -x c -S -emit-llvm - -o - | grep datalayout
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64"

However, if someone were to install Debian postgresql on a z13 system, they
would likely run into this bug.

>
>> One way to fix this would be to add -march=native to the %.bc rules in
>> src/Makefile.global.in. However, this will only work when the build system
>> and the run system are the same.
>
>> I think to fix this correctly, the llvmjit_types.bc file will need to be
>> compiled when the JIT engine is initialized at runtime, so that it can use
>> the same datalayout as the JIT engine.
>
> That'd require headers to be present, which I don't think we should
> require... And more importantly, it'd not go very far, because we also
> have lot of other .bc files that will have the layout embedded (for
> inlining functions/operators into JITed code). Which'd then require all
> the source code to be present and to be compiled into bitcode. Not an,
> uh, satisfying option.
>

OK, if that's not an option, then there needs to be a way to match the CPU
type used to compile llvmjit_types.bc and the CPU types used for the JIT
(at least for s390x anyway). I looked at a few other targets, and it
doesn't look like they have this same problem (but I could be wrong).

For Fedora, I may try to hard-code the JIT CPU type to zEC12 on s390x builds,
to fix this. I don't know if this would work as a general solution.

>
>
>> [1] https://kojipkgs.fedoraproject.org//work/tasks/2182/66082182/build.log
>
> Random thing I noticed while scrolling through the log:
>> configure: WARNING: unrecognized options: --disable-dependency-tracking
>
> PG requires dependency tracking to be explicitly enabled, and it's a
> different flag name (--enable-depend).
>

OK, thanks. I will pass this along to the postgresql maintainers.

-Tom

> Greetings,
>
> Andres Freund
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2021-04-20 22:52:28 Re: BUG #16971: Incompatible datalayout errors with llvmjit
Previous Message Andres Freund 2021-04-20 19:29:37 Re: BUG #16971: Incompatible datalayout errors with llvmjit