Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To:	KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
Cc:	Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Mark Wong <markwkm(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2025-12-31 13:04:15
Message-ID:	CAN55FZ3fWSk0h09Yfbb2eO4COfQDSL7Ofk7xF3q_Wg4ags3kPw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub <ma_kazar(at)esi(dot)dz> wrote:
>
> Hello,
> Following the same path of optimizing COPY FROM using SIMD, i found that COPY TO can also benefit from this.
>
> I attached a small patch that uses SIMD to skip data and advance as far as the first special character is found, then fallback to scalar processing for that character and re-enter the SIMD path again...
> There's two ways to do this:
> 1) Essentially we do SIMD until we find a special character, then continue scalar path without re-entering SIMD again.
> - This gives from 10% to 30% speedups depending on the weight of special characters in the attribute, we don't lose anything here since it advances with SIMD until it can't (using the previous scripts: 1/3, 2/3 specials chars).
>
> 2) Do SIMD path, then use scalar path when we hit a special character, keep re-entering the SIMD path each time.
> - This is equivalent to the COPY FROM story, we'll need to find the same heuristic to use for both COPY FROM/TO to reduce the regressions (same regressions: around from 20% to 30% with 1/3, 2/3 specials chars).
>
> Something else to note is that the scalar path for COPY TO isn't as heavy as the state machine in COPY FROM.
>
> So if we find the sweet spot for the heuristic, doing the same for COPY TO will be trivial and always beneficial.
> Attached is 0004 which is option 1 (SIMD without re-entering), 0005 is the second one.

Patches look correct to me. I think we could move these SIMD code
portions into a shared function to remove duplication, although that
might have a performance impact. I have not benchmarked these patches
yet.

Another consideration is that these patches might need their own
thread, though I am not completely sure about this yet.

One question: what do you think about having a 0004-style approach for
COPY FROM? What I have in mind is running SIMD for each line & column,
stopping SIMD once it can no longer skip an entire chunk, and then
continuing with the next line & column.

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-12-24 15:07:55 from KAZAR Ayoub

Responses

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2026-01-06 20:05:05 from Manni Wood
Re: Speed up COPY FROM text/CSV parsing using SIMD at 2026-01-17 20:46:52 from KAZAR Ayoub

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2025-12-31 13:54:12	Re: Add "format" target to make and ninja to run pgindent and pgperltidy
Previous Message	Henson Choi	2025-12-31 12:10:53	Re: Row pattern recognition