Re: Lazy JIT IR code generation to increase JIT speed with partitions

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Lazy JIT IR code generation to increase JIT speed with partitions
Date: 2021-04-12 12:11:59
Message-ID: 1ee73e09-e3d8-22cf-3b0e-9a3f1ace0596@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18-01-2021 08:47, Luc Vlaming wrote:
> Hi everyone, Andres,
>
> On 03-01-2021 11:05, Luc Vlaming wrote:
>> On 30-12-2020 14:23, Luc Vlaming wrote:
>>> On 30-12-2020 02:57, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> Great to see work in this area!
>
> I would like this topic to somehow progress and was wondering what other
> benchmarks / tests would be needed to have some progress? I've so far
> provided benchmarks for small(ish) queries and some tpch numbers,
> assuming those would be enough.
>
>>>>
>>>> On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:
>>>>> I would like to propose a small patch to the JIT machinery which
>>>>> makes the
>>>>> IR code generation lazy. The reason for postponing the generation
>>>>> of the IR
>>>>> code is that with partitions we get an explosion in the number of JIT
>>>>> functions generated as many child tables are involved, each with
>>>>> their own
>>>>> JITted functions, especially when e.g. partition-aware
>>>>> joins/aggregates are
>>>>> enabled. However, only a fraction of those functions is actually
>>>>> executed
>>>>> because the Parallel Append node distributes the workers among the
>>>>> nodes.
>>>>> With the attached patch we get a lazy generation which makes that
>>>>> this is no
>>>>> longer a problem.
>>>>
>>>> I unfortunately don't think this is quite good enough, because it'll
>>>> lead to emitting all functions separately, which can also lead to very
>>>> substantial increases of the required time (as emitting code is an
>>>> expensive step). Obviously that is only relevant in the cases where the
>>>> generated functions actually end up being used - which isn't the
>>>> case in
>>>> your example.
>>>>
>>>> If you e.g. look at a query like
>>>>    SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY
>>>> blub;
>>>> on a table without indexes, you would end up with functions for
>>>>
>>>> - WHERE clause (including deforming)
>>>> - projection (including deforming)
>>>> - grouping key
>>>> - aggregate transition
>>>> - aggregate result projection
>>>>
>>>> with your patch each of these would be emitted separately, instead of
>>>> one go. Which IIRC increases the required time by a significant amount,
>>>> especially if inlining is done (where each separate code generation
>>>> ends
>>>> up with copies of the inlined code).
>>>>
>>>>
>>>> As far as I can see you've basically falsified the second part of this
>>>> comment (which you moved):
>>>>
>>>>> +
>>>>> +    /*
>>>>> +     * Don't immediately emit nor actually generate the function.
>>>>> +     * instead do so the first time the expression is actually
>>>>> evaluated.
>>>>> +     * That allows to emit a lot of functions together, avoiding a
>>>>> lot of
>>>>> +     * repeated llvm and memory remapping overhead. It also helps
>>>>> with not
>>>>> +     * compiling functions that will never be evaluated, as can be
>>>>> the case
>>>>> +     * if e.g. a parallel append node is distributing workers
>>>>> between its
>>>>> +     * child nodes.
>>>>> +     */
>>>>
>>>>> -    /*
>>>>> -     * Don't immediately emit function, instead do so the first
>>>>> time the
>>>>> -     * expression is actually evaluated. That allows to emit a lot of
>>>>> -     * functions together, avoiding a lot of repeated llvm and memory
>>>>> -     * remapping overhead.
>>>>> -     */
>>>>
>>>> Greetings,
>>>>
>>>> Andres Freund
>>>>
>>>
>>> Hi,
>>>
>>> Happy to help out, and thanks for the info and suggestions.
>>> Also, I should have first searched psql-hackers and the like, as I
>>> just found out there is already discussions about this in [1] and [2].
>>> However I think the approach I took can be taken independently and
>>> then other solutions could be added on top.
>>>
>>> Assuming I understood all suggestions correctly, the ideas so far are:
>>> 1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed
>>> and not optimized several times (see [1]). Requires all code to be
>>> emitted in the same module.
>>> 2. JIT only parts of the plan, based on cost (see [2]).
>>> 3. Cache compilation results to avoid recompilation. this would
>>> either need a shm capable optimized IR cache or would not work with
>>> parallel workers.
>>> 4. Lazily jitting (this patch)
>>>
>>> An idea that might not have been presented in the mailing list yet(?):
>>> 5. Only JIT in nodes that process a certain amount of rows. Assuming
>>> there is a constant overhead for JITting and the goal is to gain
>>> runtime.
>>>
>>> Going forward I would first try to see if my current approach can
>>> work out. The only idea that would be counterproductive to my
>>> solution would be solution 1. Afterwards I'd like to continue with
>>> either solution 2, 5, or 3 in the hopes that we can reduce JIT
>>> overhead to a minimum and can therefore apply it more broadly.
>>>
>>> To test out why and where the JIT performance decreased with my
>>> solution I improved the test script and added various queries to
>>> model some of the cases I think we should care about. I have not
>>> (yet) done big scale benchmarks as these queries seemed to already
>>> show enough problems for now. Now there are 4 queries which test
>>> JITting with/without partitions, and with varying amounts of workers
>>> and rowcounts. I hope these are indeed a somewhat representative set
>>> of queries.
>>>
>>> As pointed out the current patch does create a degradation in
>>> performance wrt queries that are not partitioned (basically q3 and
>>> q4). After looking into those queries I noticed two things:
>>> - q3 is very noisy wrt JIT timings. This seems to be the result of
>>> something wrt parallel workers starting up the JITting and creating
>>> very high amounts of noise (e.g. inlining timings varying between
>>> 3.8s and 6.2s)
>>> - q4 seems very stable with JIT timings (after the first run).
>>> I'm wondering if this could mean that with parallel workers quite a
>>> lot of time is spent on startup of the llvm machinery and this gets
>>> noisy because of OS interaction and the like?
>>>
>>> Either way I took q4 to try and fix the regression and noticed
>>> something interesting, given the comment from Andres: the generation
>>> and inlining time actually decreased, but the optimization and
>>> emission time increased. After trying out various things in the
>>> llvm_optimize_module function and googling a bit it seems that the
>>> LLVMPassManagerBuilderUseInlinerWithThreshold adds some very
>>> expensive passes. I tried to construct some queries where this would
>>> actually gain us but couldnt (yet).
>>>
>>> For v2 of the patch-set the first patch slightly changes how we
>>> optimize the code, which removes the aforementioned degradations in
>>> the queries. The second patch then makes that partitions work a lot
>>> better, but interestingly now also q4 gets a lot faster but somehow
>>> q3 does not.
>>>
>>> Because these findings contradict the suggestions/findings from
>>> Andres I'm wondering what I'm missing. I would continue and do some
>>> TPC-H like tests on top, but apart from that I'm not entirely sure
>>> where we are supposed to gain most from the call to
>>> LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from
>>> the scenarios I now tested it seems that the pain is actually in the
>>> code optimization and possibly rather specific passes and not
>>> necessarily in how many modules are emitted.
>>>
>>> If there are more / better queries / datasets / statistics I can run
>>> and gather I would be glad to do so :) To me the current results seem
>>> however fairly promising.
>>>
>>> Looking forward to your thoughts & suggestions.
>>>
>>> With regards,
>>> Luc
>>> Swarm64
>>>
>>> ===================================
>>> Results from the test script on my machine:
>>>
>>>   parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
>>>   query1: HEAD       - 08.088901 #runs=5 #JIT=12014
>>>   query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
>>>   query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
>>>   query2: HEAD       - 17.628126 #runs=5 #JIT=24074
>>>   query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
>>>   query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
>>>   query3: HEAD       - 00.220141 #runs=5 #JIT=29
>>>   query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
>>>   query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
>>>   query4: HEAD       - 00.052305 #runs=100 #JIT=10
>>>   query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
>>>   query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3
>>>
>>>   parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
>>>   query1: HEAD       - 14.922044 #runs=5 #JIT=102104
>>>   query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
>>>   query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
>>>   query2: HEAD       - 18.477133 #runs=5 #JIT=40122
>>>   query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
>>>   query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
>>>   query3: HEAD       - 00.235587 #runs=5 #JIT=209
>>>   query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
>>>   query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
>>>   query4: HEAD       - 00.052534 #runs=100 #JIT=10
>>>   query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
>>>   query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3
>>>
>>>   parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
>>>   query1: HEAD       - 12.696588 #runs=5 #JIT=102104
>>>   query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
>>>   query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
>>>   query2: HEAD       - 12.091824 #runs=5 #JIT=40122
>>>   query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
>>>   query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
>>>   query3: HEAD       - 00.122208 #runs=5 #JIT=209
>>>   query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
>>>   query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
>>>   query4: HEAD       - 00.033125 #runs=100 #JIT=10
>>>   query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
>>>   query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3
>>>
>>>   parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
>>>   query1: HEAD       - 02.760343 #runs=5 #JIT=102104
>>>   query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
>>>   query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
>>>   query2: HEAD       - 02.396965 #runs=5 #JIT=40122
>>>   query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
>>>   query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
>>>   query3: HEAD       - 00.186633 #runs=5 #JIT=209
>>>   query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
>>>   query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
>>>   query4: HEAD       - 00.013277 #runs=100 #JIT=10
>>>   query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
>>>   query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3
>>>
>>>   parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
>>>   query1: HEAD       - 02.339973 #runs=5 #JIT=102104
>>>   query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
>>>   query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
>>>   query2: HEAD       - 02.268987 #runs=5 #JIT=40122
>>>   query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
>>>   query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
>>>   query3: HEAD       - 00.084531 #runs=5 #JIT=209
>>>   query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
>>>   query3: HEAD+01+02 - 00.08668  #runs=5 #JIT=127
>>>   query4: HEAD       - 00.005371 #runs=100 #JIT=10
>>>   query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
>>>   query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3
>>>
>>> ===================================
>>> [1]
>>> https://www.postgresql.org/message-id/flat/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793%40elevated-dev.com
>>>
>>> [2]
>>> https://www.postgresql.org/message-id/flat/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com
>>>
>>
>> Hi,
>>
>> Did some TPCH testing today on a TPCH 100G to see regressions there.
>> Results (query/HEAD/patched/speedup)
>>
>> 1    9.49    9.25    1.03
>> 3    11.87    11.65    1.02
>> 4    23.74    21.24    1.12
>> 5    11.66    11.07    1.05
>> 6    7.82    7.72    1.01
>> 7    12.1    11.23    1.08
>> 8    12.99    11.2    1.16
>> 9    71.2    68.05    1.05
>> 10    17.72    17.31    1.02
>> 11    4.75    4.16    1.14
>> 12    10.47    10.27    1.02
>> 13    38.23    38.71    0.99
>> 14    8.69    8.5    1.02
>> 15    12.63    12.6    1.00
>> 19    8.56    8.37    1.02
>> 22    10.34    9.25    1.12
>>
>> Cheers,
>> Luc
>>
>>
> Kind regards,
> Luc
>
>

Providing a new set of rebased patches with a better description in the
hopes this helps reviewability. Also registering this to the next CF to
increase visibility.

Regards,
Luc

Attachment Content-Type Size
v3-0002-Do-not-generate-the-IR-code-ahead-of-time-but-laz.patch text/x-patch 4.9 KB
v3-0001-Improve-jitting-performance-by-not-emitting-the.patch text/x-patch 3.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-04-12 12:16:37 Re: Replication slot stats misgivings
Previous Message Luc Vlaming 2021-04-12 12:01:36 Re: allow partial union-all and improve parallel subquery costing