| From: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz> |
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, Mark Wong <markwkm(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
| Subject: | Re: Speed up COPY TO text/CSV parsing using SIMD |
| Date: | 2026-03-14 22:43:38 |
| Message-ID: | CA+K2Rum7+Jm2rm65K5msxaiAM8QTkhSNAYarPBP9O7nBXYo12Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello,
On Tue, Mar 10, 2026 at 8:17 PM Nathan Bossart <nathandbossart(at)gmail(dot)com>
wrote:
> On Sat, Feb 14, 2026 at 04:02:21PM +0100, KAZAR Ayoub wrote:
> > On Thu, Feb 12, 2026 at 10:25 PM Andres Freund <andres(at)anarazel(dot)de>
> wrote:
> >> I have a hard time believing that adding a strlen() to the handling of a
> >> short column won't be a measurable overhead with lots of short
> attributes.
> >> Particularly because the patch afaict will call it repeatedly if there
> are
> >> any to-be-escaped characters.
> >
> > [...]
> >
> > 1000 columns:
> > TEXT: 17% regression
> > CSV: 3.4% regression
> >
> > 500 columns:
> > TEXT: 17.7% regression
> > CSV: 3.1% regression
> >
> > 100 columns:
> > TEXT: 17.3% regression
> > CSV: 3% regression
> >
> > A bit unstable results, but yeah the overhead for worse cases like this
> is
> > really significant, I can't argue whether this is worth it or not, so
> > thoughts on this ?
>
> I seriously doubt we'd commit something that produces a 17% regression
> here. Perhaps we should skip the SIMD paths whenever transcoding is
> required.
>
> --
> nathan
>
I've spent some time rethinking about this and here's what i've done in v3:
SIMD is only used for varlena attributes whose text representation is
longer than a single SIMD vector, and only when no transcoding is required.
Fixed-size types such as integers etc.. mostly produce short ASCII output
for which SIMD provides no benefit.
For eligible attributes, the stored varlena size is used as a cheap
pre-filter to avoid an
unnecessary strlen() call on short values.
Here are the benchmark results after many runs compared to master
(4deecb52aff):
TEXT clean: -34.0%
CSV clean: -39.3%
TEXT 1/3: +4.7%
CSV 1/3: -2.3%
the above numbers have a variance of 1% to 3% improvs or regressions
across +20 runs
WIDE tables short attributes TEXT:
50 columns: -3.7%
100 columns: -1.7%
200 columns: +1.8%
500 columns: -0.5%
1000 columns: -0.3%
WIDE tables short attributes CSV:
50 columns: -2.5%
100 columns: +1.8%
200 columns: +1.4%
500 columns: -0.9%
1000 columns: -1.1%
Wide tables benchmarks where all similar noise, across +20 runs its always
around -2% and +4% for all numbers of columns.
Just a small concern about where some varlenas have a larger binary size
than its text representation ex:
SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
pg_column_size
----------------
32
its text representation is less than sizeof(Vector8) so currently v3 would
enter SIMD path and exit out just from the beginning (two extra branches)
because it does this:
+ if (TupleDescAttr(tup_desc, attnum - 1)->attlen == -1 &&
+ VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))
I thought maybe we could do * 2 or * 4 its binary size, depends on the type
really but this is just a proposition if this case is something concerning.
Thoughts?
Regards,
Ayoub
| Attachment | Content-Type | Size |
|---|---|---|
| v3-0001-Speed-up-COPY-TO-FORMAT-text-csv-using-SIMD.patch | text/x-patch | 12.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jeff Davis | 2026-03-14 22:55:29 | Re: [19] CREATE SUBSCRIPTION ... SERVER |
| Previous Message | Zsolt Parragi | 2026-03-14 21:49:13 | Proposal: common explicit lists for installed headers |