Re: Speed up COPY TO text/CSV parsing using SIMD

From: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, Mark Wong <markwkm(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: Speed up COPY TO text/CSV parsing using SIMD
Date: 2026-03-14 22:43:38
Message-ID: CA+K2Rum7+Jm2rm65K5msxaiAM8QTkhSNAYarPBP9O7nBXYo12Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,
On Tue, Mar 10, 2026 at 8:17 PM Nathan Bossart <nathandbossart(at)gmail(dot)com>
wrote:

> On Sat, Feb 14, 2026 at 04:02:21PM +0100, KAZAR Ayoub wrote:
> > On Thu, Feb 12, 2026 at 10:25 PM Andres Freund <andres(at)anarazel(dot)de>
> wrote:
> >> I have a hard time believing that adding a strlen() to the handling of a
> >> short column won't be a measurable overhead with lots of short
> attributes.
> >> Particularly because the patch afaict will call it repeatedly if there
> are
> >> any to-be-escaped characters.
> >
> > [...]
> >
> > 1000 columns:
> > TEXT: 17% regression
> > CSV: 3.4% regression
> >
> > 500 columns:
> > TEXT: 17.7% regression
> > CSV: 3.1% regression
> >
> > 100 columns:
> > TEXT: 17.3% regression
> > CSV: 3% regression
> >
> > A bit unstable results, but yeah the overhead for worse cases like this
> is
> > really significant, I can't argue whether this is worth it or not, so
> > thoughts on this ?
>
> I seriously doubt we'd commit something that produces a 17% regression
> here. Perhaps we should skip the SIMD paths whenever transcoding is
> required.
>
> --
> nathan
>
I've spent some time rethinking about this and here's what i've done in v3:
SIMD is only used for varlena attributes whose text representation is
longer than a single SIMD vector, and only when no transcoding is required.

Fixed-size types such as integers etc.. mostly produce short ASCII output
for which SIMD provides no benefit.

For eligible attributes, the stored varlena size is used as a cheap
pre-filter to avoid an
unnecessary strlen() call on short values.

Here are the benchmark results after many runs compared to master
(4deecb52aff):
TEXT clean: -34.0%
CSV clean: -39.3%
TEXT 1/3: +4.7%
CSV 1/3: -2.3%
the above numbers have a variance of 1% to 3% improvs or regressions
across +20 runs

WIDE tables short attributes TEXT:
50 columns: -3.7%
100 columns: -1.7%
200 columns: +1.8%
500 columns: -0.5%
1000 columns: -0.3%

WIDE tables short attributes CSV:
50 columns: -2.5%
100 columns: +1.8%
200 columns: +1.4%
500 columns: -0.9%
1000 columns: -1.1%

Wide tables benchmarks where all similar noise, across +20 runs its always
around -2% and +4% for all numbers of columns.

Just a small concern about where some varlenas have a larger binary size
than its text representation ex:
SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
pg_column_size
----------------
32

its text representation is less than sizeof(Vector8) so currently v3 would
enter SIMD path and exit out just from the beginning (two extra branches)
because it does this:
+ if (TupleDescAttr(tup_desc, attnum - 1)->attlen == -1 &&
+ VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))

I thought maybe we could do * 2 or * 4 its binary size, depends on the type
really but this is just a proposition if this case is something concerning.

Thoughts?

Regards,
Ayoub

Attachment Content-Type Size
v3-0001-Speed-up-COPY-TO-FORMAT-text-csv-using-SIMD.patch text/x-patch 12.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2026-03-14 22:55:29 Re: [19] CREATE SUBSCRIPTION ... SERVER
Previous Message Zsolt Parragi 2026-03-14 21:49:13 Proposal: common explicit lists for installed headers