From: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
Date: | 2025-10-16 14:29:36 |
Message-ID: | CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>
> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> >> I am able to reproduce the regression you mentioned but both
> >> regressions are %20 on my end. I found that (by experimenting) SIMD
> >> causes a regression if it advances less than 5 characters.
> >>
> >> So, I implemented a small heuristic. It works like that:
> >>
> >> - If advance < 5 -> insert a sleep penalty (n cycles).
> > 'sleep' might be a poor word choice here. I meant skipping SIMD for n
> > number of times.
> >
>
> I was thinking a bit about that this morning. I wonder if it might be
> better instead of having a constantly applied heuristic like this, it
> might be better to do a little extra accounting in the first, say, 1000
> lines of an input file, and if less than some portion of the input is
> found to be special characters then switch to the SIMD code. What that
> portion should be would need to be determined by some experimentation
> with a variety of typical workloads, but given your findings 20% seems
> like a good starting point.
I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:
Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.
1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.
With this heuristic the regression is limited by %2 in the worst case.
Patches are attached, the first patch is v2-0001 from Shinya with the
'-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is
the actual heuristic patch.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch | text/x-patch | 3.4 KB |
v3-0002-COPY-SIMD-per-line-heuristic.patch | text/x-patch | 6.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2025-10-16 14:43:18 | Re: Thoughts on a "global" client configuration? |
Previous Message | Tomas Vondra | 2025-10-16 14:27:47 | Re: failed NUMA pages inquiry status: Operation not permitted |