Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>
Cc: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2025-12-09 13:40:19
Message-ID: CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> wrote:
> > Hello, all.
> >
> > Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.
>
> Thank you for doing this!
>
> > I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.
> >
> > The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
> >
> > This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.
>
> My input-generation script is not ready to share yet, but the inputs
> follow this format: text_${n}.input, where n represents the number of
> normal characters before the delimiter. For example:
>
> n = 0 -> "\n\n\n\n\n..." (no normal characters)
> n = 1 -> "a\n..." (1 normal character before the delimiter)
> ...
> n = 5 -> "aaaaa\n..."
> … continuing up to n = 32.
>
> Each line has 4096 chars and there are a total of 100000 lines in each
> input file.
>
> I only benchmarked the text format. I compared the latest heuristic I
> shared [1] with the current method. The benchmarks show roughly a ~16%
> regression at the worst case (n = 2), with regressions up to n = 5.
> For the remaining values, performance was similar.

I tried to improve the v4 patchset. My changes are:

1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.

2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.

3 - Instead of 'bytes_processed', I used 'chars_processed' because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.

4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
special_chars_encountered' to determine whether we want to use SIMD.

5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.

With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.

My changes are in the 0003.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v4.1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch text/x-patch 3.5 KB
v4.1-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch text/x-patch 5.0 KB
v4.1-0003-Feedback-Changes.patch text/x-patch 7.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anton Haglund 2025-12-09 14:02:38 Re: [PATCH] Update comment in nodeBitmapHeapscan.c
Previous Message David Geier 2025-12-09 13:38:51 Re: Fix repetition in hash index documentation