Re: Speed up COPY FROM text/CSV parsing using SIMD

From: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2026-02-06 22:36:13
Message-ID: CA+K2RunXPYZ+xz8OSkUa6LVjdbLYX=mEvkGR6mmqHXEQgMd1DA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

On Fri, Feb 6, 2026 at 11:19 PM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
wrote:

> Hi,
>
> Thank you for sharing your thoughts!
>
> On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <nathandbossart(at)gmail(dot)com>
> wrote:
> >
> > It looks like a lot of energy has been put into benchmarking and refining
> > the heuristic for deciding when to use the SIMD path so that we avoid
> large
> > regressions when there are special characters. I think this is all
> > valuable work, but I'm a bit concerned that we are putting the cart
> before
> > the horse. IMHO it would be better to first get the SIMD code committed
> > with the absolute simplest heuristic we can think of (e.g., as soon as we
> > see a special character, switch to the scalar path for the remainder of
> > COPY FROM). My hope is that would be far easier to reason about from a
> > performance angle. If we immediately fall back to the existing code
> path,
> > we don't need to worry about how many special characters there are and
> > whether they are sparse or clustered or whatever. We just need to
> measure
> > the overhead of the new branches and ensure they don't produce meaningful
> > regressions. Assuming that all looks good, we can then focus on the SIMD
> > code itself and make sure that is correct and optimal. And once we get
> > that portion committed, we could then consider more sophisticated
> > heuristics.
>
I also agree on this, especially for the line_buf refilling idea, it needs
a bit more time to find the good value of threshold than work for
heuristic.

>
> I have three possible approaches in my mind, they are actually similar
> to each other.
>
> 1- After encountering a special character, disable SIMD for the rest
> of the current line and also for the rest of the data.
>
> 2- It is a mixed version of the current heuristic and #1. After
> encountering a special character, skip SIMD for the current line (let'
> say line 1) and for the next line (line 2). Then try running SIMD for
> the next line (line 3), if there is no special character continue to
> run SIMD but if there is a special character then skip running SIMD
> for two lines this time. And it goes like that, everytime special
> character is encountered in the SIMD run, skipped SIMD lines are
> doubled.
>
> 3- This version is a bit different from #2. Instead of calculating the
> number of lines to skip dynamically, skip the constant N number of
> lines and then try to run SIMD again after these lines. N could be
> something like 100, 1000, or 10000 etc.. Actually, you and Andrew
> suggested this approach before [1].
>
> I think what you suggested is closer to #1 or #3. I just wanted to
> hear your opinions, and whether you think any of these approaches are
> good to implement / work on.
>
For v19, #1 seems like a "wasted potential", #3 sounds more relaxed than
v4.2 so this has good potential, i can fully benchmark it against v3 as
soon as you send a patch for it.

Regards,
Ayoub

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2026-02-06 22:39:30 Re: libpq: Bump protocol version to version 3.2 at least until the first/second beta
Previous Message Nathan Bossart 2026-02-06 22:27:22 Re: [PATCH] pg_bsd_indent: improve formatting of multiline comments