| From: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2025-11-20 12:55:43 |
| Message-ID: | CAN55FZ0e_L_O2O5W4E39vap1rz=OJjVqT7w--7gYeHpHK0a2aQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Thank you for looking into this!
On Thu, 20 Nov 2025 at 00:01, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
> > Thanks, done.
>
> I took a look at the v3 patches. Here are my high-level thoughts:
>
> + /*
> + * Parse data and transfer into line_buf. To get benefit from inlining,
> + * call CopyReadLineText() with the constant boolean variables.
> + */
> + if (cstate->simd_continue)
> + result = CopyReadLineText(cstate, is_csv, true);
> + else
> + result = CopyReadLineText(cstate, is_csv, false);
>
> I'm curious whether this actually generates different code, and if it does,
> if it's actually faster. We're already branching on cstate->simd_continue
> here.
I had the same doubts before but my benchmark shows nice speedup. I
used a test which is full of delimiters. The current code gives 2700
ms but when I changed these lines with the 'result =
CopyReadLineText(cstate, is_csv, cstate->simd_continue);', the result
was 2920 ms. I compiled code with both -O3 and -O2 and the results
were similar.
>
> + /* Load a chunk of data into a vector register */
> + vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
>
> In other places, processing 2 or 4 vectors of data at a time has proven
> faster. Have you tried that here?
Sorry, I could not find the related code piece. I only saw the
vector8_load() inside of hex_decode_safe() function and its comment
says:
/*
* We must process 2 vectors at a time since the output will be half the
* length of the input.
*/
But this does not mention any speedup from using 2 vectors at a time.
Could you please show the related code?
>
> + /* \n and \r are not special inside quotes */
> + if (!in_quote)
> + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
> +
> + if (is_csv)
> + {
> + match = vector8_or(match, vector8_eq(chunk, quote));
> + if (escapec != '\0')
> + match = vector8_or(match, vector8_eq(chunk, escape));
> + }
> + else
> + match = vector8_or(match, vector8_eq(chunk, bs));
>
> The amount of branching here catches my eye. Some branching might be
> unavoidable, but in general we want to keep these SIMD paths as branch-free
> as possible.
You are right, I will check these branches and will try to remove as
many branches as possible.
>
> + /*
> + * Found a special character. Advance up to that point and let
> + * the scalar code handle it.
> + */
> + int advance = pg_rightmost_one_pos32(mask);
> +
> + input_buf_ptr += advance;
> + simd_total_advance += advance;
>
> Do we actually need to advance here? Or could we just fall through to the
> scalar path? My suspicion is that this extra code doesn't gain us much.
My testing shows that if we advance more than ~5 characters then SIMD
is worth it, but if we advance less than ~5; then code causes a
regression. I used this information while writing a heuristic.
>
> + if (simd_last_sleep_cycle == 0)
> + simd_last_sleep_cycle = 1;
> + else if (simd_last_sleep_cycle >= SIMD_SLEEP_MAX / 2)
> + simd_last_sleep_cycle = SIMD_SLEEP_MAX;
> + else
> + simd_last_sleep_cycle <<= 1;
> + cstate->simd_current_sleep_cycle = simd_last_sleep_cycle;
> + cstate->simd_last_sleep_cycle = simd_last_sleep_cycle;
>
> IMHO we should be looking for ways to simplify this should-we-use-SIMD
> code. For example, perhaps we could just disable the SIMD path for 10K or
> 100K lines any time a special character is found. I'm dubious that a lot
> of complexity is warranted.
I think this is a bit too harsh since SIMD is still worth it if SIMD
can advance more than ~5 character average. I am trying to use SIMD as
much as possible when it is worth it but what you said can remove the
regression completely, perhaps that is the correct way.
--
Regards,
Nazir Bilal Yavuz
Microsoft
| From | Date | Subject | |
|---|---|---|---|
| Next Message | wenhui qiu | 2025-11-20 13:10:24 | Re: Parallel Apply |
| Previous Message | Tomas Vondra | 2025-11-20 12:53:48 | Re: failed NUMA pages inquiry status: Operation not permitted |