Re: Speed up COPY TO text/CSV parsing using SIMD

From: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, Mark Wong <markwkm(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: Speed up COPY TO text/CSV parsing using SIMD
Date: 2026-02-14 15:02:21
Message-ID: CA+K2Rum_QTZqTUrdMOL5hr-OOpCwGR_9Nj1z15BFObjktMOY6A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, Feb 12, 2026 at 10:25 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2026-02-12 22:07:52 +0100, KAZAR Ayoub wrote:
> > Currently optimizing COPY FROM using SIMD is still under review, but for
> > the case of COPY TO using the same ideas, we found that the problem is
> > trivial, the attached patch gives very nice speedups as confirmed by
> > Manni's benchmarks.
>
> I have a hard time believing that adding a strlen() to the handling of a
> short
> column won't be a measurable overhead with lots of short attributes.
> Particularly because the patch afaict will call it repeatedly if there are
> any
> to-be-escaped characters.
>
Thanks for pointing that out, so here's what i did:
1) In the previous patch, strlen was called twice if a CSV attribute needed
to add a quote, the attached patch gets the length in the beginning and
uses it for both SIMD paths, so basically one call.
2) If an attribute needs encoding we need to recalculate string length
because it can grow. (so 2 calls at maximum in all cases)
3) Supposing the very worse cases, i benchmarked this against master for
tables that have 100, 500, 1000 columns : all integers only, so one would
want to process the whole thing in just a pass rather than calculating
length of such short attributes:
1000 columns:
TEXT: 17% regression
CSV: 3.4% regression

500 columns:
TEXT: 17.7% regression
CSV: 3.1% regression

100 columns:
TEXT: 17.3% regression
CSV: 3% regression

A bit unstable results, but yeah the overhead for worse cases like this is
really significant, I can't argue whether this is worth it or not, so
thoughts on this ?

I also don't think it's good how much code this repeats. I think you'd have
> to
> start with preparatory moving the exiting code into static inline helper
> functions and then introduce SIMD into those.
>
Done, yet i'm not too sure whether this is the right place to put it, let
me know.

Regards,
Ayoub

Attachment Content-Type Size
v2-0001-Speed-up-COPY-TO-text-CSV-using-SIMD.patch text/x-patch 7.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Jones 2026-02-14 19:02:45 Re: Add CREATE SCHEMA ... LIKE support
Previous Message Henson Choi 2026-02-14 14:58:10 Re: Row pattern recognition