| From: | Bilal Yavuz <byavuz81(at)gmail(dot)com> |
|---|---|
| To: | Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> |
| Cc: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2025-12-09 13:40:19 |
| Message-ID: | CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> wrote:
> > Hello, all.
> >
> > Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.
>
> Thank you for doing this!
>
> > I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.
> >
> > The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
> >
> > This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.
>
> My input-generation script is not ready to share yet, but the inputs
> follow this format: text_${n}.input, where n represents the number of
> normal characters before the delimiter. For example:
>
> n = 0 -> "\n\n\n\n\n..." (no normal characters)
> n = 1 -> "a\n..." (1 normal character before the delimiter)
> ...
> n = 5 -> "aaaaa\n..."
> … continuing up to n = 32.
>
> Each line has 4096 chars and there are a total of 100000 lines in each
> input file.
>
> I only benchmarked the text format. I compared the latest heuristic I
> shared [1] with the current method. The benchmarks show roughly a ~16%
> regression at the worst case (n = 2), with regressions up to n = 5.
> For the remaining values, performance was similar.
I tried to improve the v4 patchset. My changes are:
1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.
2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.
3 - Instead of 'bytes_processed', I used 'chars_processed' because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.
4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
special_chars_encountered' to determine whether we want to use SIMD.
5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.
With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.
My changes are in the 0003.
--
Regards,
Nazir Bilal Yavuz
Microsoft
| Attachment | Content-Type | Size |
|---|---|---|
| v4.1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch | text/x-patch | 3.5 KB |
| v4.1-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch | text/x-patch | 5.0 KB |
| v4.1-0003-Feedback-Changes.patch | text/x-patch | 7.9 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Anton Haglund | 2025-12-09 14:02:38 | Re: [PATCH] Update comment in nodeBitmapHeapscan.c |
| Previous Message | David Geier | 2025-12-09 13:38:51 | Re: Fix repetition in hash index documentation |