| From: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz> |
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, Mark Wong <markwkm(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
| Subject: | Re: Speed up COPY TO text/CSV parsing using SIMD |
| Date: | 2026-03-27 18:48:38 |
| Message-ID: | CA+K2Runq+1gy8p6a-DsxpT2OkEkEu3cUGsZ9tdiGNrg_=P39gg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello,
On Thu, Mar 26, 2026 at 10:23 PM Nathan Bossart <nathandbossart(at)gmail(dot)com>
wrote:
> On Wed, Mar 18, 2026 at 03:29:32AM +0100, KAZAR Ayoub wrote:
> > If we have some json(b) column like : {"key1":"val1","key2":"val2"}, for
> > CSV format this would immediately exit the SIMD path because of quote
> > character, for json(b) this is going to be always the case.
> > I measured the overhead of exiting the SIMD path a lot (8 million times
> for
> > one COPY TO command), i only found 3% regression for this case, sometimes
> > 2%.
>
> I'm a little worried that we might be dismissing small-yet-measurable
> regressions for extremely common workloads. Unlike the COPY FROM work,
> this operates on a per-attribute level, meaning we only use SIMD when an
> attribute is at least 16 bytes. The extra branching for each attribute
> might not be something we can just ignore.
>
Thanks for the review.
I added a prescan loop inside the simd helpers trying to catch special
chars in sizeof(Vector8) characters, i measured how good is this at
reducing the overhead of starting simd and exiting at first vector:
the scalar loop is better than SIMD for one vector if it finds a special
character before 6th character, worst case is not a clean vector, where the
scalar loop needs 20 more cycles compared to SIMD.
This helps mitigate the case of JSON(B) in CSV format, this is why I only
added this for CSV case only.
In a benchmark with 10M early SIMD exit like the JSONB case, the previous
3% regression is gone.
For the normal benchmark (clean, 1/3 specials, wide table), i ran for
longer times for v4 now and i found this:
Test Master V4
TEXT clean 1619ms -28.0%
CSV clean 1866ms -37.1%
TEXT 1/3 backslashes 3913ms +1.2%
CSV 1/3 quotes 4012ms -3.0%
Wide table TEXT:
Cols Master V4
50 2109ms -2.9%
100 2029ms -1.6%
200 3982ms -2.9%
500 1962ms -6.1%
1000 3812ms -3.6%
Wide table CSV:
Cols Master V4
50 2531ms +0.3%
100 2465ms +1.1%
200 4965ms -0.2%
500 2346ms +1.4%
1000 4709ms -0.4%
Do we need more benchmarks for some other kind of workloads ? If i'm
missing something else that has noticeable overhead maybe ?
Regards,
Ayoub
| Attachment | Content-Type | Size |
|---|---|---|
| v4-0001-Speed-up-COPY-TO-FORMAT-text-csv-using-SIMD.patch | text/x-patch | 13.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2026-03-27 18:56:16 | Re: Fix race with LLVM and bison. |
| Previous Message | Payal Singh | 2026-03-27 18:44:19 | Re: Review - Patch for pg_bsd_indent: improve formatting of multiline comments |