Re: Lazy JIT IR code generation to increase JIT speed with partitions

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: Luc Vlaming <luc(at)swarm64(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Lazy JIT IR code generation to increase JIT speed with partitions
Date: 2022-06-27 14:55:55
Message-ID: CAPsAnrkVBMSGb6uiYF_Y2XdmMtz-TwdqcYfet+8vR_ETDfDGNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I picked this up and had a closer look in which way the total JIT time
depends on the number of modules to be jitted.

Indeed, the total JIT time increases the more modules are used. The reason
for this to happen is that the inlining pass loads and deserializes all to
be inlined modules (.bc files) from disk prior to inlining them via
llvm::IRMover. There's already a cache for such modules in the code, but it
is currently unused. This is because llvm::IRMover takes the module to be
inlined as std::unique_ptr<llvm::Module>. The by-value argument requires
the source module to be moved, which means it cannot be reused afterwards.
The code is accounting for that by erasing the module from the cache after
inlining it, which in turns requires reloading the module next time a
reference to it is encountered.

Instead of each time loading and deserializing all to be inlined modules
from disk, they can reside in the cache and instead be cloned via
llvm::CloneModule() before they get inlined. Key to calling
llvm::CloneModule() is fully deserializing the module upfront, instead of
loading the module lazily. That is why I changed the call from
LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2() (which
fully loads the module via llvm::parseBitcodeFile()). Beyond that it seems
like that prior to LLVM 13, cloning modules could fail with an assertion
(not sure though if that would cause problems in a release build without
assertions). Andres reported this problem back in the days here [1]. In the
meanwhile the issue got discussed in [2] and finally fixed for LLVM 13, see
[3].

Lazily jitting, typically, results in the generation of half as many
modules as there are functions. The IR for tuple deforming and expression
evaluation is always created together in ExecCompileExpr(), before the
jitted program gets used for the first time. However, if e.g. no rows gets
processed or a plan node is never executed, with the new version the
corresponding JIT program will never be created and the function / module
count can be lower.

In the following you can see some performance numbers and the reproduction
steps. Everything is based on REL_14_2. The query I built is based on
Andres initial suggestion and results in 53 functions. With the patch,
instead of a single module, 27 modules are created. Arbitrarily more
modules and functions can be generated by adding more UNIONs to the query
below. I also left out the first invocation in every test, because during
the first invocation the .bc files to be inlined still need to be loaded,
which is (1) slower now because the bitcode file is parsed completely and
(2) needs to be done only once as it now gets cached.

SET jit_above_cost = 0;
SET max_parallel_workers_per_gather = 0;
SET jit_inline_above_cost = 0;
SET jit_optimize_above_cost = 0;
CREATE TABLE foo (blub INT, zap INT, blarg INT);
INSERT INTO foo SELECT generate_series(1, 1000000), 1, 2;
ANALYZE foo;
EXPLAIN (ANALYZE, VERBOSE)
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg);

Baseline (1 module)
-------------------
Timing: Generation 0.432 ms, Inlining 4.705 ms, Optimization 29.662 ms,
Emission 21.300 ms, Total 56.098 ms (1 module, 9 functions)
Timing: Generation 2.042 ms, Inlining 4.014 ms, Optimization 164.745 ms,
Emission 120.334 ms, Total 291.135 ms (1 module, 59 functions)

CloneModule()
-------------
Timing: Generation 0.454 ms, Inlining 0.077 ms, Optimization 14.441 ms,
Emission 12.234 ms, Total 27.206 ms (1 module, 9 functions)
Timing: Generation 2.061 ms, Inlining 0.193 ms, Optimization 95.627 ms,
Emission 77.652 ms, Total 175.533 ms (1 module, 59 functions)

CloneModule(), lazy jitting (> 1 modules)
-----------------------------------------
Timing: Generation 0.457 ms, Inlining 0.154 ms, Optimization 15.972 ms,
Emission 13.152 ms, Total 29.735 ms ( 4 modules, 8 functions)
Timing: Generation 2.715 ms, Inlining 0.974 ms, Optimization 105.122 ms,
Emission 86.776 ms, Total 195.587 ms (27 modules, 53 functions)

CloneModule(), lazy jitting, optimization settings
--------------------------------------------------
Timing: Generation 0.758 ms, Inlining 0.153 ms, Optimization 13.137 ms,
Emission 13.805 ms, Total 27.854 ms ( 4 modules, 8 functions)
Timing: Generation 2.712 ms, Inlining 1.133 ms, Optimization 92.661 ms,
Emission 84.538 ms, Total 181.043 ms (27 modules, 53 functions)

Cloning the module instead of reloading it brings down inlining time, even
for a single module. Obviously, it pays off the more modules are used.
However, curiously the time spent on optimizing is also reduced (95ms
instead of 164ms). Could this be because some of the applied optimizations
are ending up in the cached module? With more modules the total time spent
on jitting seems to increase slightly (175ms vs 195ms). Almost all of that
time is spent on optimization and emission. Time for inlining only
increases from ~0.2ms to ~1.1ms. With improved optimization settings the
final time spent is pretty much on par with the single module variant
(175ms vs 181ms). So overall it looks like using many modules would this
way have competitive performance for the worst-case (all functions used),
but much better performance for cases where a significant amount of the
functions remain unused.

@Andres: could you provide me with the queries that caused the assertion
failure in LLVM? Have you ever observed a segfault with a
non-assert-enabled build? I just want to make sure this is truly fixed in
LLVM 13. Running 'make check-world' all tests passed.

The attached patch currently does not yet have a case distinction for LLVM
13. But that would be straightforward to add.

Thanks for your consideration.

--
David Geier
(ServiceNow)

[1] https://lists.llvm.org/pipermail/llvm-dev/2018-March/122111.html
[2] https://reviews.llvm.org/D96531
[3] https://reviews.llvm.org/rG22a52dfddcefad4f275eb8ad1cc0e200074c2d8a

On Fri, Jun 24, 2022 at 2:13 PM Luc Vlaming <luc(at)swarm64(dot)com> wrote:

> On 18-01-2021 08:47, Luc Vlaming wrote:
> > Hi everyone, Andres,
> >
> > On 03-01-2021 11:05, Luc Vlaming wrote:
> >> On 30-12-2020 14:23, Luc Vlaming wrote:
> >>> On 30-12-2020 02:57, Andres Freund wrote:
> >>>> Hi,
> >>>>
> >>>> Great to see work in this area!
> >
> > I would like this topic to somehow progress and was wondering what other
> > benchmarks / tests would be needed to have some progress? I've so far
> > provided benchmarks for small(ish) queries and some tpch numbers,
> > assuming those would be enough.
> >
> >>>>
> >>>> On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
> >>>>> I would like to propose a small patch to the JIT machinery which
> >>>>> makes the
> >>>>> IR code generation lazy. The reason for postponing the generation
> >>>>> of the IR
> >>>>> code is that with partitions we get an explosion in the number of JIT
> >>>>> functions generated as many child tables are involved, each with
> >>>>> their own
> >>>>> JITted functions, especially when e.g. partition-aware
> >>>>> joins/aggregates are
> >>>>> enabled. However, only a fraction of those functions is actually
> >>>>> executed
> >>>>> because the Parallel Append node distributes the workers among the
> >>>>> nodes.
> >>>>> With the attached patch we get a lazy generation which makes that
> >>>>> this is no
> >>>>> longer a problem.
> >>>>
> >>>> I unfortunately don't think this is quite good enough, because it'll
> >>>> lead to emitting all functions separately, which can also lead to very
> >>>> substantial increases of the required time (as emitting code is an
> >>>> expensive step). Obviously that is only relevant in the cases where
> the
> >>>> generated functions actually end up being used - which isn't the
> >>>> case in
> >>>> your example.
> >>>>
> >>>> If you e.g. look at a query like
> >>>> SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY
> >>>> blub;
> >>>> on a table without indexes, you would end up with functions for
> >>>>
> >>>> - WHERE clause (including deforming)
> >>>> - projection (including deforming)
> >>>> - grouping key
> >>>> - aggregate transition
> >>>> - aggregate result projection
> >>>>
> >>>> with your patch each of these would be emitted separately, instead of
> >>>> one go. Which IIRC increases the required time by a significant
> amount,
> >>>> especially if inlining is done (where each separate code generation
> >>>> ends
> >>>> up with copies of the inlined code).
> >>>>
> >>>>
> >>>> As far as I can see you've basically falsified the second part of this
> >>>> comment (which you moved):
> >>>>
> >>>>> +
> >>>>> + /*
> >>>>> + * Don't immediately emit nor actually generate the function.
> >>>>> + * instead do so the first time the expression is actually
> >>>>> evaluated.
> >>>>> + * That allows to emit a lot of functions together, avoiding a
> >>>>> lot of
> >>>>> + * repeated llvm and memory remapping overhead. It also helps
> >>>>> with not
> >>>>> + * compiling functions that will never be evaluated, as can be
> >>>>> the case
> >>>>> + * if e.g. a parallel append node is distributing workers
> >>>>> between its
> >>>>> + * child nodes.
> >>>>> + */
> >>>>
> >>>>> - /*
> >>>>> - * Don't immediately emit function, instead do so the first
> >>>>> time the
> >>>>> - * expression is actually evaluated. That allows to emit a lot
> of
> >>>>> - * functions together, avoiding a lot of repeated llvm and
> memory
> >>>>> - * remapping overhead.
> >>>>> - */
> >>>>
> >>>> Greetings,
> >>>>
> >>>> Andres Freund
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> Happy to help out, and thanks for the info and suggestions.
> >>> Also, I should have first searched psql-hackers and the like, as I
> >>> just found out there is already discussions about this in [1] and [2].
> >>> However I think the approach I took can be taken independently and
> >>> then other solutions could be added on top.
> >>>
> >>> Assuming I understood all suggestions correctly, the ideas so far are:
> >>> 1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed
> >>> and not optimized several times (see [1]). Requires all code to be
> >>> emitted in the same module.
> >>> 2. JIT only parts of the plan, based on cost (see [2]).
> >>> 3. Cache compilation results to avoid recompilation. this would
> >>> either need a shm capable optimized IR cache or would not work with
> >>> parallel workers.
> >>> 4. Lazily jitting (this patch)
> >>>
> >>> An idea that might not have been presented in the mailing list yet(?):
> >>> 5. Only JIT in nodes that process a certain amount of rows. Assuming
> >>> there is a constant overhead for JITting and the goal is to gain
> >>> runtime.
> >>>
> >>> Going forward I would first try to see if my current approach can
> >>> work out. The only idea that would be counterproductive to my
> >>> solution would be solution 1. Afterwards I'd like to continue with
> >>> either solution 2, 5, or 3 in the hopes that we can reduce JIT
> >>> overhead to a minimum and can therefore apply it more broadly.
> >>>
> >>> To test out why and where the JIT performance decreased with my
> >>> solution I improved the test script and added various queries to
> >>> model some of the cases I think we should care about. I have not
> >>> (yet) done big scale benchmarks as these queries seemed to already
> >>> show enough problems for now. Now there are 4 queries which test
> >>> JITting with/without partitions, and with varying amounts of workers
> >>> and rowcounts. I hope these are indeed a somewhat representative set
> >>> of queries.
> >>>
> >>> As pointed out the current patch does create a degradation in
> >>> performance wrt queries that are not partitioned (basically q3 and
> >>> q4). After looking into those queries I noticed two things:
> >>> - q3 is very noisy wrt JIT timings. This seems to be the result of
> >>> something wrt parallel workers starting up the JITting and creating
> >>> very high amounts of noise (e.g. inlining timings varying between
> >>> 3.8s and 6.2s)
> >>> - q4 seems very stable with JIT timings (after the first run).
> >>> I'm wondering if this could mean that with parallel workers quite a
> >>> lot of time is spent on startup of the llvm machinery and this gets
> >>> noisy because of OS interaction and the like?
> >>>
> >>> Either way I took q4 to try and fix the regression and noticed
> >>> something interesting, given the comment from Andres: the generation
> >>> and inlining time actually decreased, but the optimization and
> >>> emission time increased. After trying out various things in the
> >>> llvm_optimize_module function and googling a bit it seems that the
> >>> LLVMPassManagerBuilderUseInlinerWithThreshold adds some very
> >>> expensive passes. I tried to construct some queries where this would
> >>> actually gain us but couldnt (yet).
> >>>
> >>> For v2 of the patch-set the first patch slightly changes how we
> >>> optimize the code, which removes the aforementioned degradations in
> >>> the queries. The second patch then makes that partitions work a lot
> >>> better, but interestingly now also q4 gets a lot faster but somehow
> >>> q3 does not.
> >>>
> >>> Because these findings contradict the suggestions/findings from
> >>> Andres I'm wondering what I'm missing. I would continue and do some
> >>> TPC-H like tests on top, but apart from that I'm not entirely sure
> >>> where we are supposed to gain most from the call to
> >>> LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from
> >>> the scenarios I now tested it seems that the pain is actually in the
> >>> code optimization and possibly rather specific passes and not
> >>> necessarily in how many modules are emitted.
> >>>
> >>> If there are more / better queries / datasets / statistics I can run
> >>> and gather I would be glad to do so :) To me the current results seem
> >>> however fairly promising.
> >>>
> >>> Looking forward to your thoughts & suggestions.
> >>>
> >>> With regards,
> >>> Luc
> >>> Swarm64
> >>>
> >>> ===================================
> >>> Results from the test script on my machine:
> >>>
> >>> parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
> >>> query1: HEAD - 08.088901 #runs=5 #JIT=12014
> >>> query1: HEAD+01 - 06.369646 #runs=5 #JIT=12014
> >>> query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
> >>> query2: HEAD - 17.628126 #runs=5 #JIT=24074
> >>> query2: HEAD+01 - 10.786114 #runs=5 #JIT=24074
> >>> query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
> >>> query3: HEAD - 00.220141 #runs=5 #JIT=29
> >>> query3: HEAD+01 - 00.210917 #runs=5 #JIT=29
> >>> query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
> >>> query4: HEAD - 00.052305 #runs=100 #JIT=10
> >>> query4: HEAD+01 - 00.038319 #runs=100 #JIT=10
> >>> query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3
> >>>
> >>> parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
> >>> query1: HEAD - 14.922044 #runs=5 #JIT=102104
> >>> query1: HEAD+01 - 11.356347 #runs=5 #JIT=102104
> >>> query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
> >>> query2: HEAD - 18.477133 #runs=5 #JIT=40122
> >>> query2: HEAD+01 - 11.028579 #runs=5 #JIT=40122
> >>> query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
> >>> query3: HEAD - 00.235587 #runs=5 #JIT=209
> >>> query3: HEAD+01 - 00.219597 #runs=5 #JIT=209
> >>> query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
> >>> query4: HEAD - 00.052534 #runs=100 #JIT=10
> >>> query4: HEAD+01 - 00.038881 #runs=100 #JIT=10
> >>> query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3
> >>>
> >>> parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
> >>> query1: HEAD - 12.696588 #runs=5 #JIT=102104
> >>> query1: HEAD+01 - 12.279387 #runs=5 #JIT=102104
> >>> query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
> >>> query2: HEAD - 12.091824 #runs=5 #JIT=40122
> >>> query2: HEAD+01 - 11.543042 #runs=5 #JIT=40122
> >>> query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
> >>> query3: HEAD - 00.122208 #runs=5 #JIT=209
> >>> query3: HEAD+01 - 00.114153 #runs=5 #JIT=209
> >>> query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
> >>> query4: HEAD - 00.033125 #runs=100 #JIT=10
> >>> query4: HEAD+01 - 00.029818 #runs=100 #JIT=10
> >>> query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3
> >>>
> >>> parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
> >>> query1: HEAD - 02.760343 #runs=5 #JIT=102104
> >>> query1: HEAD+01 - 02.742944 #runs=5 #JIT=102104
> >>> query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
> >>> query2: HEAD - 02.396965 #runs=5 #JIT=40122
> >>> query2: HEAD+01 - 02.394724 #runs=5 #JIT=40122
> >>> query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
> >>> query3: HEAD - 00.186633 #runs=5 #JIT=209
> >>> query3: HEAD+01 - 00.189623 #runs=5 #JIT=209
> >>> query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
> >>> query4: HEAD - 00.013277 #runs=100 #JIT=10
> >>> query4: HEAD+01 - 00.012078 #runs=100 #JIT=10
> >>> query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3
> >>>
> >>> parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
> >>> query1: HEAD - 02.339973 #runs=5 #JIT=102104
> >>> query1: HEAD+01 - 02.333525 #runs=5 #JIT=102104
> >>> query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
> >>> query2: HEAD - 02.268987 #runs=5 #JIT=40122
> >>> query2: HEAD+01 - 02.248729 #runs=5 #JIT=40122
> >>> query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
> >>> query3: HEAD - 00.084531 #runs=5 #JIT=209
> >>> query3: HEAD+01 - 00.091616 #runs=5 #JIT=209
> >>> query3: HEAD+01+02 - 00.08668 #runs=5 #JIT=127
> >>> query4: HEAD - 00.005371 #runs=100 #JIT=10
> >>> query4: HEAD+01 - 00.0053 #runs=100 #JIT=10
> >>> query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3
> >>>
> >>> ===================================
> >>> [1]
> >>>
> https://www.postgresql.org/message-id/flat/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793%40elevated-dev.com
> >>>
> >>> [2]
> >>>
> https://www.postgresql.org/message-id/flat/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com
> >>>
> >>
> >> Hi,
> >>
> >> Did some TPCH testing today on a TPCH 100G to see regressions there.
> >> Results (query/HEAD/patched/speedup)
> >>
> >> 1 9.49 9.25 1.03
> >> 3 11.87 11.65 1.02
> >> 4 23.74 21.24 1.12
> >> 5 11.66 11.07 1.05
> >> 6 7.82 7.72 1.01
> >> 7 12.1 11.23 1.08
> >> 8 12.99 11.2 1.16
> >> 9 71.2 68.05 1.05
> >> 10 17.72 17.31 1.02
> >> 11 4.75 4.16 1.14
> >> 12 10.47 10.27 1.02
> >> 13 38.23 38.71 0.99
> >> 14 8.69 8.5 1.02
> >> 15 12.63 12.6 1.00
> >> 19 8.56 8.37 1.02
> >> 22 10.34 9.25 1.12
> >>
> >> Cheers,
> >> Luc
> >>
> >>
> > Kind regards,
> > Luc
> >
> >
>
> Providing a new set of rebased patches with a better description in the
> hopes this helps reviewability. Also registering this to the next CF to
> increase visibility.
>
> Regards,
> Luc
>

Attachment Content-Type Size
0001-Lazily-JIT.patch text/x-patch 9.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-06-27 15:10:07 Re: Making the subquery alias optional in the FROM clause
Previous Message Dean Rasheed 2022-06-27 13:49:20 Making the subquery alias optional in the FROM clause