| From: | Bilal Yavuz <byavuz81(at)gmail(dot)com> |
|---|---|
| To: | Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> |
| Cc: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2025-12-06 07:55:50 |
| Message-ID: | CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> wrote:
> Hello, all.
>
> Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.
Thank you for doing this!
> I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.
>
> The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
>
> This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.
My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:
n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.
Each line has 4096 chars and there are a total of 100000 lines in each
input file.
I only benchmarked the text format. I compared the latest heuristic I
shared [1] with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.
Actual comparison of timings (in ms):
current method / heuristic
n = 0 -> 3252.7253 / 2856.2753 (%12)
n = 1 -> 2910.321 / 2520.7717 (%13)
n = 2 -> 2865.008 / 2403.2017 (%16)
n = 3 -> 2608.649 / 2353.1477 (%9)
n = 4 -> 2460.74 / 2300.1783 (%6)
n = 5 -> 2451.696 / 2362.1573 (%3)
No difference for the rest.
Side note: Sorry for the delay in responding, I will continue working
on this next week.
[1] https://postgr.es/m/CAN55FZ1KF7XNpm2XyG%3DM-sFUODai%3D6Z8a11xE3s4YRBeBKY3tA%40mail.gmail.com
--
Regards,
Nazir Bilal Yavuz
Microsoft
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Victor Yegorov | 2025-12-06 08:07:19 | Re: Moving _bt_readpage and _bt_checkkeys into a new .c file |
| Previous Message | Bryan Green | 2025-12-06 07:08:47 | Re: [PATCH] Allow complex data for GUC extra. |