Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2026-03-10 12:35:30
Message-ID: CAN55FZ08kqmA+B9pzPDy-QstxAd=cK-RqjbR3cWBjPF_8-FXAw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, 9 Mar 2026 at 21:25, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Wed, Mar 04, 2026 at 06:15:53PM +0300, Nazir Bilal Yavuz wrote:
> > +#ifndef USE_NO_SIMD
> > +static bool CopyReadLineTextSIMDHelper(CopyFromState cstate, bool is_csv,
> > + bool *temp_hit_eof, int *temp_input_buf_ptr);
> > +#endif
>
> Should we inline this, too?

I think there is no need to inline this function. In the previous
version, SIMD code was in the main for loop which loops for every
character in the data. This means there was branching for every
character in the data. In the current version, SIMD code is outside of
this loop so there is no branching.

> > + /*
> > + * Do not disable SIMD when we hit EOL or EOF characters. In
> > + * practice, it does not matter for EOF because parsing ends
> > + * there, but we keep the behavior consistent.
> > + */
> > + if (!(simd_hit_eof || simd_hit_eol))
> > + cstate->simd_enabled = false;
>
> nitpick: I would personally avoid disabling it for EOF. It probably
> doesn't amount to much, but I don't see any point in the extra
> complexity/work solely for consistency.

Done. I thought that was a small change but this removed more
complexity than I thought.

>
> > + /*
> > + * We encountered a EOL or EOF on the first vector. This means
> > + * lines are not long enough to skip fully sized vector. If
> > + * this happens two times consecutively, then disable the
> > + * SIMD.
> > + */
> > + if (first_vector)
> > + {
> > + if (cstate->simd_failed_first_vector)
> > + cstate->simd_enabled = false;
> > +
> > + cstate->simd_failed_first_vector = true;
> > + }
>
> The first time I saw this, my mind immediately went to the extreme case
> where this likely regresses: alternating long and short lines. We might
> just want to disable it the first time we see a short line, like we do for
> special characters. This is another thing that we can improve
> independently later on.

I agree with you, done.

>
> > + /* First try to run SIMD, then continue with the scalar path */
> > + if (cstate->simd_enabled)
> > + {
> > + int temp_input_buf_ptr = input_buf_ptr;
> > + bool temp_hit_eof = false;
> > +
> > + result = CopyReadLineTextSIMDHelper(cstate, is_csv, &temp_hit_eof,
> > + &temp_input_buf_ptr);
> > + input_buf_ptr = temp_input_buf_ptr;
> > + hit_eof = temp_hit_eof;
>
> Given CopyReadLineTextSIMDHelper() doesn't have too much duplicated code,
> moving the SIMD stuff to its own function is nice. The temp variables seem
> a bit too magical to me, though. If those really make a difference, IMHO
> there ought to be a big comment explaining why.

I added a comment, please let me know if you wouldn't like it.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v12-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch text/x-patch 10.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2026-03-10 12:37:13 Re: Potential security risk associated with function call
Previous Message Jim Jones 2026-03-10 12:31:39 Re: [PATCH] Add CANONICAL option to xmlserialize