Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2026-02-13 11:45:30
Message-ID: CAN55FZ3g6QaiC8G4GMjdJ24egvgc-HG_xpoOztxnM_wnQNn5aw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the review!

On Thu, 12 Feb 2026 at 01:39, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Wed, Feb 11, 2026 at 04:27:50PM +0300, Nazir Bilal Yavuz wrote:
> > I am sharing a v6 which implements (1). My benchmark results show
> > almost no difference for the special-character cases and a nice
> > improvement for the no-special-character cases.
>
> Thanks!
>
> > + /* Initialize SIMD variables */
> > + cstate->simd_enabled = false;
> > + cstate->simd_initialized = false;
>
> > + /* Initialize SIMD on the first read */
> > + if (unlikely(!cstate->simd_initialized))
> > + {
> > + cstate->simd_initialized = true;
> > + cstate->simd_enabled = true;
> > + }
>
> Why do we do this initialization in CopyReadLine() as opposed to setting
> simd_enabled to true when initializing cstate in BeginCopyFrom()? If we
> can initialize it in BeginCopyFrom, we could probably remove
> simd_initialized.

Correct, I guess this is left over from the earlier versions.

> > + if (cstate->simd_enabled)
> > + result = CopyReadLineText(cstate, is_csv, true);
> > + else
> > + result = CopyReadLineText(cstate, is_csv, false);
>
> I know we discussed this upthread, but I'd like to take a closer look at
> this to see whether/why it makes such a big difference. It's a bit awkward
> that CopyReadLineText() needs to manage both its local simd_enabled and
> cstate->simd_enabled.

I extensively benchmarked this with the new v6 version. If I change
this to either of:

CopyReadLineText(cstate, is_csv);
or
CopyReadLineText(cstate, is_csv, cstate->simd_enabled);

then there is %5-%10 regression for the scalar path. I ran my
benchmarks with both "meson --buildtype=debugoptimized" and "meson
--buildtype=release" but the result is the same.

Also, if I change this code to:

if (cstate->simd_enabled)
{
if (is_csv)
result = CopyReadLineText(cstate, true, true);
else
result = CopyReadLineText(cstate, false, true);
}
else
{
if (is_csv)
result = CopyReadLineText(cstate, true, false);
else
result = CopyReadLineText(cstate, false, false);
}

then I see ~%5 performance improvement in scalar path compared to master.

> + /* Load a chunk of data into a vector register */
> + vector8_load(&chunk, (const uint8 *) &copy_input_buf[input_buf_ptr]);
>
> As mentioned upthread [0], I think it's worth testing whether processing
> multiple vectors worth of data in each loop iteration is worthwhile.
>
> [0] https://postgr.es/m/aSTVOe6BIe5f1l3i%40nathan

There are multiple keys in CopyReadLineText() compared to
pg_lfind32(). I am not sure if I correctly used multiple vectors but I
attached what I did as 0002, could you please look at it? I didn't see
any performance benefit in my benchmarks, though.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v7-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch text/x-patch 7.8 KB
v7-0002-Use-4-vectors-in-CopyReadLineText-SIMD.patch text/x-patch 6.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-02-13 11:46:50 Re: Row pattern recognition
Previous Message Daniel Gustafsson 2026-02-13 11:42:29 Re: Follow-up on OpenSSL "engines" and "providers"