Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2025-10-16 14:29:36
Message-ID:	CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>
> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> >> I am able to reproduce the regression you mentioned but both
> >> regressions are %20 on my end. I found that (by experimenting) SIMD
> >> causes a regression if it advances less than 5 characters.
> >>
> >> So, I implemented a small heuristic. It works like that:
> >>
> >> - If advance < 5 -> insert a sleep penalty (n cycles).
> > 'sleep' might be a poor word choice here. I meant skipping SIMD for n
> > number of times.
> >
>
> I was thinking a bit about that this morning. I wonder if it might be
> better instead of having a constantly applied heuristic like this, it
> might be better to do a little extra accounting in the first, say, 1000
> lines of an input file, and if less than some portion of the input is
> found to be special characters then switch to the SIMD code. What that
> portion should be would need to be determined by some experimentation
> with a variety of typical workloads, but given your findings 20% seems
> like a good starting point.

I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:

Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.

1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.

With this heuristic the regression is limited by %2 in the worst case.

Patches are attached, the first patch is v2-0001 from Shinya with the
'-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is
the actual heuristic patch.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment	Content-Type	Size
v3-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch	text/x-patch	3.4 KB
v3-0002-COPY-SIMD-per-line-heuristic.patch	text/x-patch	6.5 KB

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-08-21 15:47:30 from Andrew Dunstan

Responses

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-10-18 18:46:29 from KAZAR Ayoub
Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-10-18 20:01:29 from Nazir Bilal Yavuz

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jacob Champion	2025-10-16 14:43:18	Re: Thoughts on a "global" client configuration?
Previous Message	Tomas Vondra	2025-10-16 14:27:47	Re: failed NUMA pages inquiry status: Operation not permitted