| From: | Pierre Ducroquet <p(dot)psql(at)pinaraf(dot)info> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: [PATCH] llvmjit: always add the simplifycfg pass |
| Date: | 2026-01-30 15:01:52 |
| Message-ID: | pJBA_YlJwSojFSBFctsdfSOfoSv2cPS9u68eH1niIUFzYj8eImTRvNCx1jaKGbBsHMM2o6plKbQZlBcoLqG7GjK0scAeuior6SkmggWrmLs=@pinaraf.info |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Le jeudi 29 janvier 2026 à 12:19 AM, Andres Freund <andres(at)anarazel(dot)de> a écrit :
> Hi,
>
> On 2026-01-28 07:56:46 +0000, Pierre Ducroquet wrote:
>
> > Here is a rebased version of the patch with a rewrite of the comment. Thank
> > you again for your previous review. FYI, I've tried adding other passes but
> > none had a similar benefits over cost ratio. The benefits could rather be in
> > changing from O3 to an extensive list of passes.
>
>
> I agree that we should have a better list of passes. I'm a bit worried that
> having an explicit list of passes that we manage ourselves is going to be
> somewhat of a pain to maintain across llvm versions, but ...
>
> WRT passes that might be worth having even with -O0 - running duplicate
> function merging early on could be quite useful, particularly because we won't
> inline the deform routines anyway.
>
> > > I did some benchmarks on some TPCH queries (1 and 4) and I got these
> > > results. Note that for these tests I set jit_optimize_above_cost=1000000
> > > so that it force to use the default<O0> pass with simplifycfg.
>
>
> FYI, you can use -1 to just disble it, instead of having to rely on a specific
> cost.
>
> > > Master Q1:
> > > Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Optimization 95.571 ms, Emission 58.941 ms, Total 156.116 ms
> > > Execution Time: 38221.318 ms
> > >
> > > Patch Q1:
> > > Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Optimization 95.364 ms, Emission 58.046 ms, Total 154.927 ms
> > > Execution Time: 38257.797 ms
> > >
> > > Master Q4:
> > > Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Optimization 5.098 ms, Emission 6.963 ms, Total 12.983 ms
> > > Execution Time: 19512.134 ms
> > >
> > > Patch Q4:
> > > Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Optimization 5.234 ms, Emission 6.521 ms, Total 12.648 ms
> > > Execution Time: 16051.483 ms
> > >
> > > For Q4 I see a small increase on Optimization phase but we have a good
> > > performance improvement on execution time. For Q1 the results are almost
> > > the same.
>
>
> These queries are all simple enough that I'm not sure this is a particularly
> good benchmark for optimization speed. In particular, the deform routines
> don't have to deal with a lot of columns and there aren't a lot of functions
> (although I guess that shouldn't really matter WRT simplifycfg).
>
simplifycfg seems to do more things on the deforming functions than I anticipated initially, explaining the performance benefits. I've written patches to our C code to generate better IR, but I discovered quite a puzzle.
The biggest gain I see on the generated amd64 code for a very simple query (SELECT * FROM demo WHERE a = 42) with simplifycfg is that it prevents spilling on the stack and it does what mem2reg was supposed to be doing.
Running opt -debug-pass-manager on a deform function, I get:
- with default<O0>,mem2reg
Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (56 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running analysis: TargetIRAnalysis on deform_0_1
deform_0_1: # @deform_0_1
.cfi_startproc
# %bb.0: # %entry
movq 24(%rdi), %rax
movq %rax, -48(%rsp) # 8-byte Spill
movq 32(%rdi), %rax
movq %rax, -40(%rsp) # 8-byte Spill
movq %rdi, %rax
addq $4, %rax
movq %rax, -32(%rsp) # 8-byte Spill
movq %rdi, %rax
addq $6, %rax
movq %rax, -24(%rsp) # 8-byte Spill
movq %rdi, %rax
addq $72, %rax
movq %rax, -16(%rsp) # 8-byte Spill
...
- with default<O0>,simplifycfg
Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
deform_0_1: # @deform_0_1
.cfi_startproc
# %bb.0: # %entry
movq 24(%rdi), %rax
movq 32(%rdi), %rsi
movq 64(%rdi), %rcx
movq 16(%rcx), %rcx
movzbl 22(%rcx), %edx
movslq %edx, %rdx
addq %rdx, %rcx
movl 72(%rdi), %edx
...
- with default<O0>,simplifycfg,mem2reg
Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (46 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1
deform_0_1: # @deform_0_1
.cfi_startproc
# %bb.0: # %entry
movq 24(%rdi), %rax
movq 32(%rdi), %rsi
movq 64(%rdi), %rcx
movq 16(%rcx), %rcx
movzbl 22(%rcx), %edx
movb $0, (%rsi)
...
So even when running only simplifycfg, the stack allocation goes away.
I am trying to figure that one out, but I suspect we are no longer doing the optimizations we thought we were doing with mem2reg only, hence the (surprising) speed gains with simplifycfg.
Note:
Ubuntu LLVM version 19.1.7
Optimized build.
Default target: x86_64-pc-linux-gnu
Host CPU: znver5
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2026-01-30 15:10:34 | Re: Flush some statistics within running transactions |
| Previous Message | Boris Mironov | 2026-01-30 14:47:54 | Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY) |