Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2026-02-20 10:01:27
Message-ID: CAN55FZ0MiFCgK26gRgE05a=_ggenkxDM8H=A2uTHpywczqt=-Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sat, 14 Feb 2026 at 02:09, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> Some other random thoughts:
>
> + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
>
> + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
>
> Since \n and \r are well below "normal" ASCII values, I wonder if we could
> simplify these to something like
>
> match = vector8_gt(... vector with all lanes set to \r + 1 ..., chunk);

This didn't work because we have horizontal tab characters whose value
(9) is lower than '\r' (13).

> + /* Check if we found any special characters */
> + mask = vector8_highbit_mask(match);
> + if (mask != 0)
>
> vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
> waiting until we enter the "if" block to calculate it has any benefit.

I think this makes sense, done.

> + simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv || !in_quote);
>
> If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
> place, right?

You are right, I put an assertion for this.

> + simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
> +
> + /*
> + * Do not disable SIMD when we hit EOL or EOF characters. In
> + * practice, it does not matter for EOF because parsing ends
> + * there, but we keep the behavior consistent.
> + */
> + if (!(simd_hit_eof || simd_hit_eol))
>
> I'd think that doing less unnecessary work would outweigh the benefits of
> consistency for the EOF case.

This will work once for the data since SIMD will be disabled
afterwards. So, I think this shouldn't affect the performance but I am
okay to change if you prefer.

I have bencharmed v9 and didn't see any regression.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v9-0001-Speedup-COPY-FROM-with-additional-function-inlini.patch text/x-patch 2.6 KB
v9-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch text/x-patch 7.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message VASUKI M 2026-02-20 10:06:20 Re: [PATCH] Fix incorrect Spanish translation and remove obsolete FIXME comments
Previous Message Hayato Kuroda (Fujitsu) 2026-02-20 09:59:15 RE: parallel data loading for pgbench -i