Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc:	KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2026-02-20 10:01:27
Message-ID:	CAN55FZ0MiFCgK26gRgE05a=_ggenkxDM8H=A2uTHpywczqt=-Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On Sat, 14 Feb 2026 at 02:09, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> Some other random thoughts:
>
> + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
>
> + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
>
> Since \n and \r are well below "normal" ASCII values, I wonder if we could
> simplify these to something like
>
> match = vector8_gt(... vector with all lanes set to \r + 1 ..., chunk);

This didn't work because we have horizontal tab characters whose value
(9) is lower than '\r' (13).

> + /* Check if we found any special characters */
> + mask = vector8_highbit_mask(match);
> + if (mask != 0)
>
> vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
> waiting until we enter the "if" block to calculate it has any benefit.

I think this makes sense, done.

> + simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv || !in_quote);
>
> If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
> place, right?

You are right, I put an assertion for this.

> + simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
> +
> + /*
> + * Do not disable SIMD when we hit EOL or EOF characters. In
> + * practice, it does not matter for EOF because parsing ends
> + * there, but we keep the behavior consistent.
> + */
> + if (!(simd_hit_eof || simd_hit_eol))
>
> I'd think that doing less unnecessary work would outweigh the benefits of
> consistency for the EOF case.

This will work once for the data since SIMD will be disabled
afterwards. So, I think this shouldn't affect the performance but I am
okay to change if you prefer.

I have bencharmed v9 and didn't see any regression.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment	Content-Type	Size
v9-0001-Speedup-COPY-FROM-with-additional-function-inlini.patch	text/x-patch	2.6 KB
v9-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch	text/x-patch	7.7 KB

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2026-02-13 23:09:25 from Nathan Bossart

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	VASUKI M	2026-02-20 10:06:20	Re: [PATCH] Fix incorrect Spanish translation and remove obsolete FIXME comments
Previous Message	Hayato Kuroda (Fujitsu)	2026-02-20 09:59:15	RE: parallel data loading for pgbench -i