| From: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2026-02-20 10:01:27 |
| Message-ID: | CAN55FZ0MiFCgK26gRgE05a=_ggenkxDM8H=A2uTHpywczqt=-Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Sat, 14 Feb 2026 at 02:09, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> Some other random thoughts:
>
> + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
>
> + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
>
> Since \n and \r are well below "normal" ASCII values, I wonder if we could
> simplify these to something like
>
> match = vector8_gt(... vector with all lanes set to \r + 1 ..., chunk);
This didn't work because we have horizontal tab characters whose value
(9) is lower than '\r' (13).
> + /* Check if we found any special characters */
> + mask = vector8_highbit_mask(match);
> + if (mask != 0)
>
> vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
> waiting until we enter the "if" block to calculate it has any benefit.
I think this makes sense, done.
> + simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv || !in_quote);
>
> If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
> place, right?
You are right, I put an assertion for this.
> + simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
> +
> + /*
> + * Do not disable SIMD when we hit EOL or EOF characters. In
> + * practice, it does not matter for EOF because parsing ends
> + * there, but we keep the behavior consistent.
> + */
> + if (!(simd_hit_eof || simd_hit_eol))
>
> I'd think that doing less unnecessary work would outweigh the benefits of
> consistency for the EOF case.
This will work once for the data since SIMD will be disabled
afterwards. So, I think this shouldn't affect the performance but I am
okay to change if you prefer.
I have bencharmed v9 and didn't see any regression.
--
Regards,
Nazir Bilal Yavuz
Microsoft
| Attachment | Content-Type | Size |
|---|---|---|
| v9-0001-Speedup-COPY-FROM-with-additional-function-inlini.patch | text/x-patch | 2.6 KB |
| v9-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch | text/x-patch | 7.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | VASUKI M | 2026-02-20 10:06:20 | Re: [PATCH] Fix incorrect Spanish translation and remove obsolete FIXME comments |
| Previous Message | Hayato Kuroda (Fujitsu) | 2026-02-20 09:59:15 | RE: parallel data loading for pgbench -i |