Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
To:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2025-11-19 21:01:03
Message-ID:	aR4wDwNdLc5TmcQq@nathan
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
> Thanks, done.

I took a look at the v3 patches. Here are my high-level thoughts:

+ /*
+ * Parse data and transfer into line_buf. To get benefit from inlining,
+ * call CopyReadLineText() with the constant boolean variables.
+ */
+ if (cstate->simd_continue)
+ result = CopyReadLineText(cstate, is_csv, true);
+ else
+ result = CopyReadLineText(cstate, is_csv, false);

I'm curious whether this actually generates different code, and if it does,
if it's actually faster. We're already branching on cstate->simd_continue
here.

+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) &copy_input_buf[input_buf_ptr]);

In other places, processing 2 or 4 vectors of data at a time has proven
faster. Have you tried that here?

+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));

The amount of branching here catches my eye. Some branching might be
unavoidable, but in general we want to keep these SIMD paths as branch-free
as possible.

+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ input_buf_ptr += advance;
+ simd_total_advance += advance;

Do we actually need to advance here? Or could we just fall through to the
scalar path? My suspicion is that this extra code doesn't gain us much.

+ if (simd_last_sleep_cycle == 0)
+ simd_last_sleep_cycle = 1;
+ else if (simd_last_sleep_cycle >= SIMD_SLEEP_MAX / 2)
+ simd_last_sleep_cycle = SIMD_SLEEP_MAX;
+ else
+ simd_last_sleep_cycle <<= 1;
+ cstate->simd_current_sleep_cycle = simd_last_sleep_cycle;
+ cstate->simd_last_sleep_cycle = simd_last_sleep_cycle;

IMHO we should be looking for ways to simplify this should-we-use-SIMD
code. For example, perhaps we could just disable the SIMD path for 10K or
100K lines any time a special character is found. I'm dubious that a lot
of complexity is warranted.

--
nathan

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-11-18 14:20:05 from Nazir Bilal Yavuz

Responses

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-11-20 12:55:43 from Nazir Bilal Yavuz
Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-11-26 11:50:58 from KAZAR Ayoub

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2025-11-19 21:02:17	Re: PRI?64 vs Visual Studio (2022)
Previous Message	Laurenz Albe	2025-11-19 20:57:12	Re: System views for versions reporting