Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To:	Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2025-08-07 11:15:06
Message-ID:	CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Thank you for working on this!

On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> text}) command and observed approximately a 5% performance
> improvement. Please see the detailed test results below.

I have been working on the same idea. I was not moving input_buf_ptr
as far as possible, so I think your approach is better.

Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.

> Idea
> ====
> The current text/CSV parser processes input byte-by-byte, checking
> whether each byte is a special character (\n, \r, quote, escape) or a
> regular character, and transitions states in a state machine. This
> sequential processing is inefficient and likely causes frequent branch
> mispredictions due to the many if statements.
>
> I thought this problem could be addressed by leveraging SIMD and
> vectorized operations for faster processing.
>
> Implementation Overview
> =======================
> 1. Create a vector of special characters (e.g., Vector8 nl =
> vector8_broadcast('\n');).
> 2. Load the input buffer into a Vector8 variable called chunk.
> 3. Perform vectorized operations between chunk and the special
> character vectors to check if the buffer contains any special
> characters.
> 4-1. If no special characters are found, advance the input_buf_ptr by
> sizeof(Vector8).
> 4-2. If special characters are found, advance the input_buf_ptr as far
> as possible, then fall back to the original text/CSV parser for
> byte-by-byte processing.
>
...
> Thought?
> I would appreciate feedback on the implementation and any suggestions
> for further improvement.

I have a couple of ideas that I was working on:
---

+ * However, SIMD optimization cannot be applied in the following cases:
+ * - Inside quoted fields, where escape sequences and closing quotes
+ * require sequential processing to handle correctly.

I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
---

+ * - When the remaining buffer size is smaller than the size of a SIMD
+ * vector register, as SIMD operations require processing data in
+ * fixed-size chunks.

You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
copy_buf_len || need_data' so basically you need to wait at least the
sizeof(Vector8) character to pass for the next SIMD. And in the worst
case; if CopyLoadInputBuf() puts one character less than
sizeof(Vector8), then you can't ever run SIMD. I think we need to make
sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
character to the input_buf so we do not encounter that problem.
---

What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
---

I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
radar for this thread). I hope these help, please let me know if you
have any questions.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment	Content-Type	Size
Feedback.txt	text/plain	3.4 KB

In response to

Speed up COPY FROM text/CSV parsing using SIMD at 2025-08-07 01:48:30 from Shinya Kato

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Aleksander Alekseev	2025-08-07 11:43:48	Re: [PATCH] Refactor bytea_sortsupport(), take two
Previous Message	Ashutosh Bapat	2025-08-07 10:47:37	Re: SQL Property Graph Queries (SQL/PGQ)