Re: Lazy JIT IR code generation to increase JIT speed with partitions

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Lazy JIT IR code generation to increase JIT speed with partitions
Date: 2021-01-18 07:47:56
Message-ID: 33147b42-3386-6c71-cfca-1e997e6b73f5@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone, Andres,

On 03-01-2021 11:05, Luc Vlaming wrote:
> On 30-12-2020 14:23, Luc Vlaming wrote:
>> On 30-12-2020 02:57, Andres Freund wrote:
>>> Hi,
>>>
>>> Great to see work in this area!

I would like this topic to somehow progress and was wondering what other
benchmarks / tests would be needed to have some progress? I've so far
provided benchmarks for small(ish) queries and some tpch numbers,
assuming those would be enough.

>>>
>>> On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
>>>> I would like to propose a small patch to the JIT machinery which
>>>> makes the
>>>> IR code generation lazy. The reason for postponing the generation of
>>>> the IR
>>>> code is that with partitions we get an explosion in the number of JIT
>>>> functions generated as many child tables are involved, each with
>>>> their own
>>>> JITted functions, especially when e.g. partition-aware
>>>> joins/aggregates are
>>>> enabled. However, only a fraction of those functions is actually
>>>> executed
>>>> because the Parallel Append node distributes the workers among the
>>>> nodes.
>>>> With the attached patch we get a lazy generation which makes that
>>>> this is no
>>>> longer a problem.
>>>
>>> I unfortunately don't think this is quite good enough, because it'll
>>> lead to emitting all functions separately, which can also lead to very
>>> substantial increases of the required time (as emitting code is an
>>> expensive step). Obviously that is only relevant in the cases where the
>>> generated functions actually end up being used - which isn't the case in
>>> your example.
>>>
>>> If you e.g. look at a query like
>>>    SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY
>>> blub;
>>> on a table without indexes, you would end up with functions for
>>>
>>> - WHERE clause (including deforming)
>>> - projection (including deforming)
>>> - grouping key
>>> - aggregate transition
>>> - aggregate result projection
>>>
>>> with your patch each of these would be emitted separately, instead of
>>> one go. Which IIRC increases the required time by a significant amount,
>>> especially if inlining is done (where each separate code generation ends
>>> up with copies of the inlined code).
>>>
>>>
>>> As far as I can see you've basically falsified the second part of this
>>> comment (which you moved):
>>>
>>>> +
>>>> +    /*
>>>> +     * Don't immediately emit nor actually generate the function.
>>>> +     * instead do so the first time the expression is actually
>>>> evaluated.
>>>> +     * That allows to emit a lot of functions together, avoiding a
>>>> lot of
>>>> +     * repeated llvm and memory remapping overhead. It also helps
>>>> with not
>>>> +     * compiling functions that will never be evaluated, as can be
>>>> the case
>>>> +     * if e.g. a parallel append node is distributing workers
>>>> between its
>>>> +     * child nodes.
>>>> +     */
>>>
>>>> -    /*
>>>> -     * Don't immediately emit function, instead do so the first
>>>> time the
>>>> -     * expression is actually evaluated. That allows to emit a lot of
>>>> -     * functions together, avoiding a lot of repeated llvm and memory
>>>> -     * remapping overhead.
>>>> -     */
>>>
>>> Greetings,
>>>
>>> Andres Freund
>>>
>>
>> Hi,
>>
>> Happy to help out, and thanks for the info and suggestions.
>> Also, I should have first searched psql-hackers and the like, as I
>> just found out there is already discussions about this in [1] and [2].
>> However I think the approach I took can be taken independently and
>> then other solutions could be added on top.
>>
>> Assuming I understood all suggestions correctly, the ideas so far are:
>> 1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed
>> and not optimized several times (see [1]). Requires all code to be
>> emitted in the same module.
>> 2. JIT only parts of the plan, based on cost (see [2]).
>> 3. Cache compilation results to avoid recompilation. this would either
>> need a shm capable optimized IR cache or would not work with parallel
>> workers.
>> 4. Lazily jitting (this patch)
>>
>> An idea that might not have been presented in the mailing list yet(?):
>> 5. Only JIT in nodes that process a certain amount of rows. Assuming
>> there is a constant overhead for JITting and the goal is to gain runtime.
>>
>> Going forward I would first try to see if my current approach can work
>> out. The only idea that would be counterproductive to my solution
>> would be solution 1. Afterwards I'd like to continue with either
>> solution 2, 5, or 3 in the hopes that we can reduce JIT overhead to a
>> minimum and can therefore apply it more broadly.
>>
>> To test out why and where the JIT performance decreased with my
>> solution I improved the test script and added various queries to model
>> some of the cases I think we should care about. I have not (yet) done
>> big scale benchmarks as these queries seemed to already show enough
>> problems for now. Now there are 4 queries which test JITting
>> with/without partitions, and with varying amounts of workers and
>> rowcounts. I hope these are indeed a somewhat representative set of
>> queries.
>>
>> As pointed out the current patch does create a degradation in
>> performance wrt queries that are not partitioned (basically q3 and
>> q4). After looking into those queries I noticed two things:
>> - q3 is very noisy wrt JIT timings. This seems to be the result of
>> something wrt parallel workers starting up the JITting and creating
>> very high amounts of noise (e.g. inlining timings varying between 3.8s
>> and 6.2s)
>> - q4 seems very stable with JIT timings (after the first run).
>> I'm wondering if this could mean that with parallel workers quite a
>> lot of time is spent on startup of the llvm machinery and this gets
>> noisy because of OS interaction and the like?
>>
>> Either way I took q4 to try and fix the regression and noticed
>> something interesting, given the comment from Andres: the generation
>> and inlining time actually decreased, but the optimization and
>> emission time increased. After trying out various things in the
>> llvm_optimize_module function and googling a bit it seems that the
>> LLVMPassManagerBuilderUseInlinerWithThreshold adds some very expensive
>> passes. I tried to construct some queries where this would actually
>> gain us but couldnt (yet).
>>
>> For v2 of the patch-set the first patch slightly changes how we
>> optimize the code, which removes the aforementioned degradations in
>> the queries. The second patch then makes that partitions work a lot
>> better, but interestingly now also q4 gets a lot faster but somehow q3
>> does not.
>>
>> Because these findings contradict the suggestions/findings from Andres
>> I'm wondering what I'm missing. I would continue and do some TPC-H
>> like tests on top, but apart from that I'm not entirely sure where we
>> are supposed to gain most from the call to
>> LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from
>> the scenarios I now tested it seems that the pain is actually in the
>> code optimization and possibly rather specific passes and not
>> necessarily in how many modules are emitted.
>>
>> If there are more / better queries / datasets / statistics I can run
>> and gather I would be glad to do so :) To me the current results seem
>> however fairly promising.
>>
>> Looking forward to your thoughts & suggestions.
>>
>> With regards,
>> Luc
>> Swarm64
>>
>> ===================================
>> Results from the test script on my machine:
>>
>>   parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
>>   query1: HEAD       - 08.088901 #runs=5 #JIT=12014
>>   query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
>>   query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
>>   query2: HEAD       - 17.628126 #runs=5 #JIT=24074
>>   query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
>>   query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
>>   query3: HEAD       - 00.220141 #runs=5 #JIT=29
>>   query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
>>   query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
>>   query4: HEAD       - 00.052305 #runs=100 #JIT=10
>>   query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
>>   query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3
>>
>>   parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
>>   query1: HEAD       - 14.922044 #runs=5 #JIT=102104
>>   query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
>>   query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
>>   query2: HEAD       - 18.477133 #runs=5 #JIT=40122
>>   query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
>>   query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
>>   query3: HEAD       - 00.235587 #runs=5 #JIT=209
>>   query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
>>   query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
>>   query4: HEAD       - 00.052534 #runs=100 #JIT=10
>>   query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
>>   query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3
>>
>>   parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
>>   query1: HEAD       - 12.696588 #runs=5 #JIT=102104
>>   query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
>>   query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
>>   query2: HEAD       - 12.091824 #runs=5 #JIT=40122
>>   query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
>>   query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
>>   query3: HEAD       - 00.122208 #runs=5 #JIT=209
>>   query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
>>   query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
>>   query4: HEAD       - 00.033125 #runs=100 #JIT=10
>>   query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
>>   query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3
>>
>>   parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
>>   query1: HEAD       - 02.760343 #runs=5 #JIT=102104
>>   query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
>>   query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
>>   query2: HEAD       - 02.396965 #runs=5 #JIT=40122
>>   query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
>>   query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
>>   query3: HEAD       - 00.186633 #runs=5 #JIT=209
>>   query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
>>   query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
>>   query4: HEAD       - 00.013277 #runs=100 #JIT=10
>>   query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
>>   query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3
>>
>>   parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
>>   query1: HEAD       - 02.339973 #runs=5 #JIT=102104
>>   query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
>>   query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
>>   query2: HEAD       - 02.268987 #runs=5 #JIT=40122
>>   query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
>>   query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
>>   query3: HEAD       - 00.084531 #runs=5 #JIT=209
>>   query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
>>   query3: HEAD+01+02 - 00.08668  #runs=5 #JIT=127
>>   query4: HEAD       - 00.005371 #runs=100 #JIT=10
>>   query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
>>   query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3
>>
>> ===================================
>> [1]
>> https://www.postgresql.org/message-id/flat/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793%40elevated-dev.com
>>
>> [2]
>> https://www.postgresql.org/message-id/flat/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com
>>
>
> Hi,
>
> Did some TPCH testing today on a TPCH 100G to see regressions there.
> Results (query/HEAD/patched/speedup)
>
> 1    9.49    9.25    1.03
> 3    11.87    11.65    1.02
> 4    23.74    21.24    1.12
> 5    11.66    11.07    1.05
> 6    7.82    7.72    1.01
> 7    12.1    11.23    1.08
> 8    12.99    11.2    1.16
> 9    71.2    68.05    1.05
> 10    17.72    17.31    1.02
> 11    4.75    4.16    1.14
> 12    10.47    10.27    1.02
> 13    38.23    38.71    0.99
> 14    8.69    8.5    1.02
> 15    12.63    12.6    1.00
> 19    8.56    8.37    1.02
> 22    10.34    9.25    1.12
>
> Cheers,
> Luc
>
>
Kind regards,
Luc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message kuroda.hayato@fujitsu.com 2021-01-18 07:49:20 RE: ResourceOwner refactoring
Previous Message japin 2021-01-18 07:46:32 Narrow the scope of the variable outputstr in logicalrep_write_tuple