JIT compiling with LLVM v9.0

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: JIT compiling with LLVM v9.0
Date: 2018-01-24 07:20:38
Message-ID: 20180124072038.jviav7h3fgkv7hto@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've spent the last weeks working on my LLVM compilation patchset. In
the course of that I *heavily* revised it. While still a good bit away
from committable, it's IMO definitely not a prototype anymore.

There's too many small changes, so I'm only going to list the major
things. A good bit of that is new. The actual LLVM IR emissions itself
hasn't changed that drastically. Since I've not described them in
detail before I'll describe from scratch in a few cases, even if things
haven't fully changed.

== JIT Interface ==

To avoid emitting code in very small increments (increases mmap/mremap
rw vs exec remapping, compile/optimization time), code generation
doesn't happen for every single expression individually, but in batches.

The basic object to emit code via is a jit context created with:
extern LLVMJitContext *llvm_create_context(bool optimize);
which in case of expression is stored on-demand in the EState. For other
usecases that might not be the right location.

To emit LLVM IR (ie. the portabe code that LLVM then optimizes and
generates native code for), one gets a module from that with:
extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);

to which "arbitrary" numbers of functions can be added. In case of
expression evaluation, we get the module once for every expression, and
emit one function for the expression itself, and one for every
applicable/referenced deform function.

As explained above, we do not want to emit code immediately from within
ExecInitExpr()/ExecReadyExpr(). To facilitate that readying a JITed
expression sets the function to callback, which gets the actual native
function on the first actual call. That allows to batch together the
generation of all native functions that are defined before the first
expression is evaluated - in a lot of queries that'll be all.

Said callback then calls
extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
which'll emit code for the "in progress" mutable module if necessary,
and then searches all generated functions for the name. The names are
created via
extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
currently "evalexpr" and deform" with a generation and counter suffix.

Currently expression which do not have access to an EState, basically
all "parent" less expressions, aren't JIT compiled. That could be
changed, but I so far do not see a huge need.

== Error handling ==

There's two aspects to error handling.

Firstly, generated (LLVM IR) and emitted functions (mmap()ed segments)
need to be cleaned up both after a successful query execution and after
an error. I've settled on a fairly boring resowner based mechanism. On
errors all expressions owned by a resowner are released, upon success
expressions are reassigned to the parent / released on commit (unless
executor shutdown has cleaned them up of course).

A second, less pretty and newly developed, aspect of error handling is
OOM handling inside LLVM itself. The above resowner based mechanism
takes care of cleaning up emitted code upon ERROR, but there's also the
chance that LLVM itself runs out of memory. LLVM by default does *not*
use any C++ exceptions. It's allocations are primarily funneled through
the standard "new" handlers, and some direct use of malloc() and
mmap(). For the former a 'new handler' exists
http://en.cppreference.com/w/cpp/memory/new/set_new_handler for the
latter LLVM provides callback that get called upon failure
(unfortunately mmap() failures are treated as fatal rather than OOM
errors).
What I've chosen to do, and I'd be interested to get some input about
that, is to have two functions that LLVM using code must use:
extern void llvm_enter_fatal_on_oom(void);
extern void llvm_leave_fatal_on_oom(void);
before interacting with LLVM code (ie. emitting IR, or using the above
functions) llvm_enter_fatal_on_oom() needs to be called.

When a libstdc++ new or LLVM error occurs, the handlers set up by the
above functions trigger a FATAL error. We have to use FATAL rather than
ERROR, as we *cannot* reliably throw ERROR inside a foreign library
without risking corrupting its internal state.

Users of the above sections do *not* have to use PG_TRY/CATCH blocks,
the handlers instead are reset on toplevel sigsetjmp() level.

Using a relatively small enter/leave protected section of code, rather
than setting up these handlers globally, avoids negative interactions
with extensions that might use C++ like e.g. postgis. As LLVM code
generation should never execute arbitrary code, just setting these
handlers temporarily ought to suffice.

== LLVM Interface / patches ==

Unfortunately a bit of required LLVM functionality, particularly around
error handling but also initialization, aren't currently fully exposed
via LLVM's C-API. A bit more *optional* API isn't exposed either.

Instead of requiring a brand-new version of LLVM that has exposed this
functionality I decided it's better to have a small C++ wrapper that can
provide this functionality. Due to that new wrapper significantly older
LLVM versions can now be used (for now I've only runtime tested 5.0 and
master, 4.0 would be possible with a few ifdefs, a bit older probably
doable as well). Given that LLVM is written in C++ itself, and optional
dependency to a C++ compiler for one file doesn't seem to be too bad.

== Inlining ==

One big advantage of JITing expressions is that it can significantly
reduce the overhead of postgres' extensible function/operator mechanism,
by inlining the body of called operators.

This is the part of code that I've worked on most significantly. While I
think JITing is an entirely viable project without committed inlining, I
felt that we definitely need to know how exactly we want to do inlining
before merging other parts. 3 different implementations later, I'm
fairly confident that I have a good concept, even though a few corners
still need to be smoothed.

As a quick background, LLVM works on the basis of a high-level
"abstract" assembly representation (llvm.org/docs/LangRef.html). This
can be generated in memory, stored in binary form (bitcode files ending
in .bc) or text representation (.ll files). The clang compiler always
generates the in-memory representation and the -emit-llvm flag tells it
to write that out to disk, rather than .o files/binaries.

This facility allows us to get the bitcode for all operators
(e.g. int8eq, float8pl etc), without maintaining two copies. The way
I've currently set it up is that, if --with-llvm is passed to configure,
all backend files are also compiled to bitcode files. These bitcode
files get installed into the server's
$pkglibdir/bitcode/postgres/
under their original subfolder, eg.
~/build/postgres/dev-assert/install/lib/bitcode/postgres/utils/adt/float.bc
Using existing LLVM functionality (for parallel LTO compilation),
additionally an index is over these is stored to
$pkglibdir/bitcode/postgres.index.bc

When deciding to JIT for the first time, $pkglibdir/bitcode/ is scanned
for all .index.bc files and a *combined* index over all these files is
built in memory. The reason for doing so is that that allows "easy"
access to inlining access for extensions - they can install code into
$pkglibdir/bitcode/[extension]/
accompanied by
$pkglibdir/bitcode/[extension].index.bc
just alongside the actual library.

The inlining implementation, I had to write my own LLVM's isn't suitable
for a number of reasons, can then use the combined in-memory index to
look up all 'extern' function references, judge their size, and then
open just the file containing its implementation (ie. the above
float.bc). Currently there's a limit of 150 instructions for functions
to be inlined, functions used by inlined functions have a budget of 0.5
* limit, and so on. This gets rid of most operators I in queries I
tested, although there's a few that resist inlining due to references to
file-local static variables - but those largely don't seem to be
performance relevant.

== Type Synchronization ==

For my current two main avenues of performance optimizations due to
JITing, expression eval and tuple deforming, it's obviously required
that code generation knows about at least a few postgres types (tuple
slots, heap tuples, expr context/state, etc).

Initially I'd provided LLVM by emitting types manually like:
{
LLVMTypeRef members[15];

members[ 0] = LLVMInt32Type(); /* type */
members[ 1] = LLVMInt8Type(); /* isempty */
members[ 2] = LLVMInt8Type(); /* shouldFree */
members[ 3] = LLVMInt8Type(); /* shouldFreeMin */
members[ 4] = LLVMInt8Type(); /* slow */
members[ 5] = LLVMPointerType(StructHeapTupleData, 0); /* tuple */
members[ 6] = LLVMPointerType(StructtupleDesc, 0); /* tupleDescriptor */
members[ 7] = TypeMemoryContext; /* mcxt */
members[ 8] = LLVMInt32Type(); /* buffer */
members[ 9] = LLVMInt32Type(); /* nvalid */
members[10] = LLVMPointerType(TypeSizeT, 0); /* values */
members[11] = LLVMPointerType(LLVMInt8Type(), 0); /* nulls */
members[12] = LLVMPointerType(StructMinimalTupleData, 0); /* mintuple */
members[13] = StructHeapTupleData; /* minhdr */
members[14] = LLVMInt64Type(); /* off */

StructTupleTableSlot = LLVMStructCreateNamed(LLVMGetGlobalContext(),
"struct.TupleTableSlot");
LLVMStructSetBody(StructTupleTableSlot, members, lengthof(members), false);
}
and then using numeric offset when emitting code like:
LLVMBuildStructGEP(builder, v_slot, 9, "")
to compute the address of nvalid field of a slot at runtime.

but that obviously duplicates a lot of information and is incredibly
failure prone. Doesn't seem acceptable.

What I've now instead done is have one small file (llvmjit_types.c)
which references each of the types required for JITing. That file is
translated to bitcode at compile time, and loaded when LLVM is
initialized in a backend. That works very well to synchronize the type
definition, unfortunately it does *not* synchronize offsets as the IR
level representation doesn't know field names.

Instead I've added defines to the original struct definition that
provide access to the relevant offsets. Eg.
#define FIELDNO_TUPLETABLESLOT_NVALID 9
int tts_nvalid; /* # of valid values in tts_values */
while that still needs to be defined, it's only required for a
relatively small number of fields, and it's bunched together with the
struct definition, so it's easily kept synchronized.

A significant downside for this is that clang needs to be around to
create that bitcode file, but that doesn't seem that bad as an optional
*build*-time, *not* runtime, dependency.

Not a perfect solution, but I don't quite see a better approach.

== Minimal cost based planning & config ==

Currently there's a number of GUCs that influence JITing:

- jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
get JITed, *without* optimization (expensive part), corresponding to
-O0. This commonly already results in significant speedups if
expression/deforming is a bottleneck (removing dynamic branches
mostly).
- jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
get JITed, *with* optimization (expensive part).
- jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has
higher cost.

For all of these -1 is a hard disable.

There currently also exist:
- jit_expressions = 0/1
- jit_deform = 0/1
- jit_perform_inlining = 0/1
but I think they could just be removed in favor of the above.

Additionally there's a few debugging/other GUCs:

- jit_debugging_support = 0/1 - register generated functions with the
debugger. Unfortunately GDBs JIT integration scales O(#functions^2),
albeit with a very small constant, so it cannot always be enabled :(
- jit_profiling_support = 0/1 - emit information so perf gets notified
about JITed functions. As this logs data to disk that is not
automatically cleaned up (otherwise it'd be useless), this definitely
cannot be enabled by default.
- jit_dump_bitcode = 0/1 - log generated pre/post optimization bitcode
to disk. This is quite useful for development, so I'd want to keep it.
- jit_log_ir = 0/1 - dump generated IR to the logfile. I found this to
be too verbose, and I think it should be yanked.

Do people feel these should be hidden behind #ifdefs, always present but
prevent from being set to a meaningful, or unrestricted?

=== Remaining work ==

These I'm planning to tackle in the near future and need to be tackled
before mergin.

- Add a big readme
- Add docs
- Add / check LLVM 4.0 support
- reconsider location of JITing code (lib/ and heaptuple.c specifically)
- Split llvmjit_wrap.cpp into three files (error handling, inlining,
temporary LLVM C API extensions)
- Split the bigger commit, improve commit messages
- Significant amounts of local code cleanup and comments
- duplicated code in expression emission for very related step types
- more consistent LLVM variable naming
- pgindent
- timing information about JITing needs to be fewer messages, and hidden
behind a GUC.
- improve logging (mostly remove)

== Future Todo (some already in-progress) ==

- JITed hash computation for nodeAgg & nodeHash. That's currently a
major bottleneck.
- Increase quality of generated code. There's a *lot* left still on the
table. The generated code currently spills far too much into memory,
and LLVM only can optimize that away to a limited degree. I've
experimented some and for TPCH Q01 it's possible to get at least
another x1.8 due to that, with expression eval *still* being the
bottleneck afterwards...
- Caching of the generated code, drastically reducing overhead and
allowing JITing to be beneficial in OLTP cases. Currently the biggest
obstacle to that is the number of specific memory locations referenced
in the expression representation, but that definitely can be improved
(a lot of it by the above point alone).
- More elaborate planning model
- The cloning of modules could e reduced to only cloning required
parts. As that's the most expensive part of inlining and most of the
time only a few functions are used, this should probably be done soon.

== Code ==

As the patchset is large (500kb) and I'm still quickly evolving it, I do
not yet want to attach it. The git tree is at
https://git.postgresql.org/git/users/andresfreund/postgres.git
in the jit branch
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/jit

to build --with-llvm has to be passed to configure, llvm-config either
needs to be in PATH or provided with LLVM_CONFIG to make. A c++ compiler
and clang need to be available under common names or provided via CXX /
CLANG respectively.

Regards,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2018-01-24 07:33:26 Re: [HACKERS] Subscription code improvements
Previous Message Justin Pryzby 2018-01-24 07:20:09 Re: Doc tweak for huge_pages?