| From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
|---|---|
| To: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2026-03-05 21:25:14 |
| Message-ID: | 91acb778-42c4-44ef-8888-f18ad9b12a5b@dunslane.net |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 2026-03-04 We 10:15 AM, Nazir Bilal Yavuz wrote:
> Hi,
>
> On Mon, 2 Mar 2026 at 22:55, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>> On Wed, Feb 25, 2026 at 05:24:27PM +0300, Nazir Bilal Yavuz wrote:
>>> If anyone has any suggestions/ideas, please let me know!
> I am able to fix the problem. My first assumption was that the
> branching of SIMD code caused that problem, so I moved SIMD code to
> the CopyReadLineTextSIMDHelper() function. Then I moved this
> CopyReadLineTextSIMDHelper() to top of CopyReadLineText(), by doing
> that we won't have any branching in the non-SIMD (scalar) code path.
> This didn't solve the problem and then I realized that even though I
> disable SIMD code path with 'if (false)', there is still regression
> but if I comment all of the 'if (cstate->simd_enabled)' branch, then
> there is no regression at all.
>
> To find out more, I compared assembly outputs of both and found out
> the possible reason. What I understood is that the compiler can't
> promote a variable to register, instead these variables live in the
> stack; which is slower. Please see the two different assembly outputs:
>
> Slow code:
>
> c = copy_input_buf[input_buf_ptr++];
> db0: 48 8b 55 b8 mov -0x48(%rbp),%rdx
> db4: 48 63 c6 movslq %esi,%rax
> db7: 44 8d 66 01 lea 0x1(%rsi),%r12d
> dbb: 44 89 65 cc mov %r12d,-0x34(%rbp)
> dbf: 0f be 14 02 movsbl (%rdx,%rax,1),%edx
>
> Fast code:
>
> c = copy_input_buf[input_buf_ptr++];
> d80: 49 63 c4 movslq %r12d,%rax
> d83: 45 8d 5c 24 01 lea 0x1(%r12),%r11d
> d88: 41 0f be 04 06 movsbl (%r14,%rax,1),%eax
>
> And the reason for that is sending the address of input_buf_ptr to a
> CopyReadLineTextSIMDHelper(..., &input_buf_ptr). If I change it to
> this:
>
> int temp_input_buf_ptr = input_buf_ptr;
> CopyReadLineTextSIMDHelper(..., &temp_input_buf_ptr);
>
> Then there is no regression. However, I am still not completely sure
> if that is the same problem in the v10, I am planning to spend more
> time debugging this.
>
>> A couple of random ideas:
>>
>> * Additional inlining for callers. I looked around a little bit and didn't
>> see any great candidates, so I don't have much faith in this, but maybe
>> you'll see something I don't.
> I agree with you. CopyReadLineText() is already quite a big function.
>
>> * Disable SIMD if we are consistently getting small rows. That won't help
>> your "wide & CSV 1/3" case in all likelihood, but perhaps it'll help with
>> the regression for narrow rows described elsewhere.
> I implemented this, two consecutive small rows disables SIMD.
>
>> * Surround the variable initializations with "if (simd_enabled)".
>> Presumably compilers are smart enough to remove those in the non-SIMD paths
>> already, but it could be worth a try.
> Done.
>
>> * Add simd_enabled function parameter to CopyReadLine(),
>> NextCopyFromRawFieldsInternal(), and CopyFromTextLikeOneRow(), and do the
>> bool literal trick in CopyFrom{Text,CSV}OneRow(). That could encourage the
>> compiler to do some additional optimizations to reduce branching.
> I think we don't need this. At least the implementation with
> CopyReadLineTextSIMDHelper() doesn't need this since branching will be
> at the top and it will be once per line.
>
> I think v11 looks better compared to v10. I liked the
> CopyReadLineTextSIMDHelper() helper function. I also liked it being at
> the top of CopyReadLineText(), not being in the scalar path. This
> gives us more optimization options without affecting the scalar path.
>
> Here are the new benchmark results, I benchmarked the changes with
> both -O2 and -O3 and also both with and without 'changing
> default_toast_compression to lz4' commit (65def42b1d5). Benchmark
> results show that there is no regression and the performance
> improvement is much bigger with 65def42b1d5, it is close to 2x for
> text format and more than 2x for the csv format.
I spent some time exploring different ideas for improving this, but
found none that didn't cause regression in some cases, so good to go
from my POV.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Corey Huinker | 2026-03-05 21:40:56 | Re: Add starelid, attnum to pg_stats and leverage this in pg_dump |
| Previous Message | Andrew Jackson | 2026-03-05 21:15:26 | Re: Add Option To Check All Addresses For Matching target_session_attr |