| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Infrastructure for JIT compiling tuple deforming | 
| Date: | 2017-08-08 20:22:33 | 
| Message-ID: | 20170808202233.gkebkdka3pdicfyy@alap3.anarazel.de | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
As previously mentioned, tuple deforming is a major bottleneck, and
JITing it can be highly beneficial.  I previously had posted a prototype
that does JITing at the slot_deform_tuple() level, caching the deformed
function in the tupledesc.
Storing things in the tupledesc isn't a great concept however - the
lifetime of the generated function is hard to manage. But more
importantly, and even if we moved this into the slot, it precludes
important optimization.
JITing the deforming is a *lot* more efficient if we can combine it with
the JITing of the expressions using the deformed expression. There's a
couple of reasons for that:
1) By knowing the exact attnum the caller is going to request, the code
   can be optimized. No need to generate code for columns not
   deformed. If there's NOT NULL columns at/after the last
   to-be-deformed column, there's no need to generate checks about the
   length of the null-bitmap - getting rid of about half the branches!
2) By generating the deforming code in the generated expression code,
   the code will be generated together.. That's a good chunk of the
   overhead, of the memory mapping overhead, and it noticeably reduces
   function call overhead (because relative near calls can be used).
3) LLVM's optimizer can inline parts / all of the tuple deforming code
   into the expression evaluation function, further reducing
   overhead. In simpler cases and with some additional prodding, llvm
   even can interleave deforming of individual columns and their use
   (note that I'm not proposing to do so initially).
4) If we know that the underlying tuple is an actual nonvirtual tuple,
   e.g. on the scan level, the slot deforming of NOT NULL can be
   replaced with direct byte accesses to the relevant column - a good
   chunk faster again.
   (note that I'm not proposing to do so initially)
The problem however is that when generating the expression code we don't
have the necessary information. In my current prototype I'm emitting the
LLVM IR (the input to LLVM) at ExecInitExpr() time for all expressions
in a tree. That allows to emit the code for all functions in executor
tree in one go.  But unfortunately the current executor initiation
"framework" doesn't provide information about the underlying slot
tupledescs at that time.  Nor does it actually guarantee that the
tupledesc / slots stay the same over the course of the execution.
Therefore I'd like to somehow change things so that the executor keeps
track of whether the tupledesc of inner/outer/scan are going to change,
and if not provide them.
The right approach here seems to be to add a bit of extra data to
ExecAssignScanType etc., and move ExecInitExpr / ExecInitQual /
ExecAssignScanProjectionInfo /... to after that.  We then could keep
track of of the relevant tupledescs somewhere in PlanState - that's a
bit ugly, but I don't quite see how to avoid that unless we want to add
major executor-node awareness into expression evaluation.
Thoughts? Better ideas?
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2017-08-08 20:34:51 | Re: Server crash (FailedAssertion) due to catcache refcount mis-handling | 
| Previous Message | Robert Haas | 2017-08-08 19:50:25 | Re: reload-through-the-top-parent switch the partition table |