| From: | Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> |
|---|---|
| To: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz> |
| Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2025-12-06 01:39:56 |
| Message-ID: | CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Nov 26, 2025 at 8:21 AM Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>
wrote:
>
>
> On Wed, Nov 26, 2025 at 5:51 AM KAZAR Ayoub <ma_kazar(at)esi(dot)dz> wrote:
>
>> Hello,
>> On Wed, Nov 19, 2025 at 10:01 PM Nathan Bossart <nathandbossart(at)gmail(dot)com>
>> wrote:
>>
>>> On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
>>> > Thanks, done.
>>>
>>> I took a look at the v3 patches. Here are my high-level thoughts:
>>>
>>> + /*
>>> + * Parse data and transfer into line_buf. To get benefit from
>>> inlining,
>>> + * call CopyReadLineText() with the constant boolean variables.
>>> + */
>>> + if (cstate->simd_continue)
>>> + result = CopyReadLineText(cstate, is_csv, true);
>>> + else
>>> + result = CopyReadLineText(cstate, is_csv, false);
>>>
>>> I'm curious whether this actually generates different code, and if it
>>> does,
>>> if it's actually faster. We're already branching on
>>> cstate->simd_continue
>>> here.
>>
>> I've compiled both versions with -O2 and confirmed they generate
>> different code. When simd_continue is passed as a constant to
>> CopyReadLineText, the compiler optimizes out the condition checks from the
>> SIMD path.
>> A small benchmark on a 1GB+ file shows the expected benefit which is
>> around 6% performance improvement.
>> I've attached the assembly outputs in case someone wants to check
>> something else.
>>
>>
>> Regards,
>> Ayoub Kazar
>>
>
> Correction to my last post:
>
> I also tried files that alternated lines with no special characters and
> lines with 1/3rd special characters, thinking I could force the algorithm
> to continually check whether or not it should use simd and therefore force
> more overhead in the try-simd/don't-try-simd housekeeping code. The text
> file was still 20% faster (not 50% faster as I originally stated --- that
> was a typo). The CSV file was still 13% faster.
>
> Also, apologies for posting at the top in my last e-mail.
> --
> -- Manni Wood EDB: https://www.enterprisedb.com
>
Hello, all.
Andrew, I tried your suggestion of just reading the first chunk of the copy
file to determine if SIMD is worth using. Attached are v4 versions of the
patches showing a first attempt at doing that.
I attached test.sh.txt to show how I've been testing, with 5 million lines
of the various copy file variations introduced by Ayub Kazar.
The text copy with no special chars is 30% faster. The CSV copy with no
special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The
CSV with 1/3rd quotes is 0.27% slower.
This set of patches follows the simplest suggestion of just testing the
first N lines (actually first N bytes) of the file and then deciding
whether or not to enable SIMD. This set of patches does not follow Andrew's
later suggestion of maybe checking again every million lines or so.
--
-- Manni Wood EDB: https://www.enterprisedb.com
| Attachment | Content-Type | Size |
|---|---|---|
| test.sh.txt | text/plain | 2.6 KB |
| v4-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch | text/x-patch | 5.0 KB |
| v4-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch | text/x-patch | 3.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Chao Li | 2025-12-06 01:35:51 | Re: Making jsonb_agg() faster |