Re: Speed up COPY FROM text/CSV parsing using SIMD

From: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
To: Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2025-08-14 02:24:50
Message-ID: CA+K2RumC79NwWxBdofHOYo8SCSs0YCJic05Du=xOszRmoPf9FA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Following Nazir's findings about 4096 bytes being the performant line
length, I did more benchmarks from my side on both TEXT and CSV formats
with two different cases of normal data (no special characters) and data
with many special characters.

Results are con good as expected and similar to previous benchmarks
~30.9% faster copy in TEXT format
~32.4% faster copy in CSV format
20%-30% reduces cycles per instructions

In the case of doing a lot of special characters in the lines (e.g., tables
with large numbers of columns maybe), we obviously expect regressions here
because of the overhead of many fallbacks to scalar processing.
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distance
between there might still be some regressions in this case, a
non-significant case maybe, but can be treated in other patches if we
consider to not use SIMD path sometimes.

I hope this helps more and confirms the patch.

Regards,
Ayoub Kazar

Le jeu. 14 août 2025 à 01:55, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> a
écrit :

> On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
> wrote:
>
> > > + * However, SIMD optimization cannot be applied in the
> following cases:
> > > + * - Inside quoted fields, where escape sequences and closing
> quotes
> > > + * require sequential processing to handle correctly.
> > >
> > > I think you can continue SIMD inside quoted fields. Only important
> > > thing is you need to set last_was_esc to false when SIMD skipped the
> > > chunk.
> >
> > That's a clever point that last_was_esc should be reset to false when
> > a SIMD chunk is skipped. You're right about that specific case.
> >
> > However, the core challenge is not what happens when we skip a chunk,
> > but what happens when a chunk contains special characters like quotes
> > or escapes. The main reason we avoid SIMD inside quoted fields is that
> > the parsing logic becomes fundamentally sequential and
> > context-dependent.
> >
> > To correctly parse a "" as a single literal quote, we must perform a
> > lookahead to check the next character. This is an inherently
> > sequential operation that doesn't map well to SIMD's parallel nature.
> >
> > Trying to handle this stateful logic with SIMD would lead to
> > significant implementation complexity, especially with edge cases like
> > an escape character falling on the last byte of a chunk.
>
> Ah, you're right. My apologies, I misunderstood the implementation. It
> appears that SIMD can be used even within quoted strings.
>
> I think it would be better not to use the SIMD path when last_was_esc
> is true. The next character is likely to be a special character, and
> handling this case outside the SIMD loop would also improve
> readability by consolidating the last_was_esc toggle logic in one
> place.
>
> Furthermore, when inside a quote (in_quote) in CSV mode, the detection
> of \n and \r can be disabled.
>
> + last_was_esc = false;
>
> Regarding the implementation, I believe we must set last_was_esc to
> false when advancing input_buf_ptr, as shown in the code below. For
> this reason, I think it’s best to keep the current logic for toggling
> last_was_esc.
>
> + int advance = pg_rightmost_one_pos32(mask);
> + input_buf_ptr += advance;
>
> I've attached a new patch that includes these changes. Further
> modifications are still in progress.
>
> --
> Best regards,
> Shinya Kato
> NTT OSS Center
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ziga 2025-08-14 02:29:28 Re: Retail DDL
Previous Message Japin Li 2025-08-14 02:21:56 Re: Update the LSN format in the comment example