| From: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz> |
|---|---|
| To: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
| Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2026-02-06 22:36:13 |
| Message-ID: | CA+K2RunXPYZ+xz8OSkUa6LVjdbLYX=mEvkGR6mmqHXEQgMd1DA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello,
On Fri, Feb 6, 2026 at 11:19 PM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
wrote:
> Hi,
>
> Thank you for sharing your thoughts!
>
> On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <nathandbossart(at)gmail(dot)com>
> wrote:
> >
> > It looks like a lot of energy has been put into benchmarking and refining
> > the heuristic for deciding when to use the SIMD path so that we avoid
> large
> > regressions when there are special characters. I think this is all
> > valuable work, but I'm a bit concerned that we are putting the cart
> before
> > the horse. IMHO it would be better to first get the SIMD code committed
> > with the absolute simplest heuristic we can think of (e.g., as soon as we
> > see a special character, switch to the scalar path for the remainder of
> > COPY FROM). My hope is that would be far easier to reason about from a
> > performance angle. If we immediately fall back to the existing code
> path,
> > we don't need to worry about how many special characters there are and
> > whether they are sparse or clustered or whatever. We just need to
> measure
> > the overhead of the new branches and ensure they don't produce meaningful
> > regressions. Assuming that all looks good, we can then focus on the SIMD
> > code itself and make sure that is correct and optimal. And once we get
> > that portion committed, we could then consider more sophisticated
> > heuristics.
>
I also agree on this, especially for the line_buf refilling idea, it needs
a bit more time to find the good value of threshold than work for
heuristic.
>
> I have three possible approaches in my mind, they are actually similar
> to each other.
>
> 1- After encountering a special character, disable SIMD for the rest
> of the current line and also for the rest of the data.
>
> 2- It is a mixed version of the current heuristic and #1. After
> encountering a special character, skip SIMD for the current line (let'
> say line 1) and for the next line (line 2). Then try running SIMD for
> the next line (line 3), if there is no special character continue to
> run SIMD but if there is a special character then skip running SIMD
> for two lines this time. And it goes like that, everytime special
> character is encountered in the SIMD run, skipped SIMD lines are
> doubled.
>
> 3- This version is a bit different from #2. Instead of calculating the
> number of lines to skip dynamically, skip the constant N number of
> lines and then try to run SIMD again after these lines. N could be
> something like 100, 1000, or 10000 etc.. Actually, you and Andrew
> suggested this approach before [1].
>
> I think what you suggested is closer to #1 or #3. I just wanted to
> hear your opinions, and whether you think any of these approaches are
> good to implement / work on.
>
For v19, #1 seems like a "wasted potential", #3 sounds more relaxed than
v4.2 so this has good potential, i can fully benchmark it against v3 as
soon as you send a patch for it.
Regards,
Ayoub
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jacob Champion | 2026-02-06 22:39:30 | Re: libpq: Bump protocol version to version 3.2 at least until the first/second beta |
| Previous Message | Nathan Bossart | 2026-02-06 22:27:22 | Re: [PATCH] pg_bsd_indent: improve formatting of multiline comments |