Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
To:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2025-08-13 06:21:06
Message-ID:	CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> wrote:

> > + * However, SIMD optimization cannot be applied in the following cases:
> > + * - Inside quoted fields, where escape sequences and closing quotes
> > + * require sequential processing to handle correctly.
> >
> > I think you can continue SIMD inside quoted fields. Only important
> > thing is you need to set last_was_esc to false when SIMD skipped the
> > chunk.
>
> That's a clever point that last_was_esc should be reset to false when
> a SIMD chunk is skipped. You're right about that specific case.
>
> However, the core challenge is not what happens when we skip a chunk,
> but what happens when a chunk contains special characters like quotes
> or escapes. The main reason we avoid SIMD inside quoted fields is that
> the parsing logic becomes fundamentally sequential and
> context-dependent.
>
> To correctly parse a "" as a single literal quote, we must perform a
> lookahead to check the next character. This is an inherently
> sequential operation that doesn't map well to SIMD's parallel nature.
>
> Trying to handle this stateful logic with SIMD would lead to
> significant implementation complexity, especially with edge cases like
> an escape character falling on the last byte of a chunk.

Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.

I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.

Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.

+ last_was_esc = false;

Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.

+ int advance = pg_rightmost_one_pos32(mask);
+ input_buf_ptr += advance;

I've attached a new patch that includes these changes. Further
modifications are still in progress.

--
Best regards,
Shinya Kato
NTT OSS Center

Attachment	Content-Type	Size
v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch	application/octet-stream	3.4 KB

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-08-12 07:25:36 from Shinya Kato

Responses

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-08-14 02:24:50 from KAZAR Ayoub

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	jian he	2025-08-13 06:33:40	Re: on_error table, saving error info to a table
Previous Message	Amit Kapila	2025-08-13 06:08:33	Re: PG 18 release notes draft committed