| From: | Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> |
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
| Cc: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2026-02-14 03:34:13 |
| Message-ID: | CAKWEB6p-Y54yWA5kq6OXEYV=ABdHenJ559i0MshOoYkP4i=o5A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello!
I ran some COPY FROM tests using master and then Nazir's v7-0001 and
v7-0002 patches applied to master.
x86 master
TXT : 29222.524250 ms
CSV : 36162.588500 ms
TXT with 1/3 escapes: 32922.649750 ms
CSV with 1/3 quotes: 47631.423750 ms
x86 v7-0001
TXT : 23247.834250 ms 20.445496% improvement
CSV : 23162.711750 ms 35.948413% improvement
TXT with 1/3 escapes: 31786.386000 ms 3.451313% improvement
CSV with 1/3 quotes: 43330.475500 ms 9.029645% improvement
x86 v7-0002
TXT : 22394.812500 ms 23.364552% improvement
CSV : 22374.645750 ms 38.127643% improvement
TXT with 1/3 escapes: 32378.929750 ms 1.651507% improvement
CSV with 1/3 quotes: 47139.171750 ms 1.033461% improvement
arm master
TXT : 9448.900500 ms
CSV : 11135.871500 ms
TXT with 1/3 escapes: 10786.418750 ms
CSV with 1/3 quotes: 14115.335500 ms
arm v7-0001
TXT : 7271.170500 ms 23.047443% improvement
CSV : 7259.866750 ms 34.806479% improvement
TXT with 1/3 escapes: 10894.445500 ms -1.001507% regression
CSV with 1/3 quotes: 13398.444000 ms 5.078813% improvement
arm v7-0002
TXT : 7165.707250 ms 24.163587% improvement
CSV : 7140.497250 ms 35.878416% improvement
TXT with 1/3 escapes: 10308.782250 ms 4.428129% improvement
CSV with 1/3 quotes: 12576.179500 ms 10.904140% improvement
v7-0001 + v7-0002 applied to master certainly seems promising: nice to see
speed improvements across the board on both x86 and arm!
On Fri, Feb 13, 2026 at 5:09 PM Nathan Bossart <nathandbossart(at)gmail(dot)com>
wrote:
> On Fri, Feb 13, 2026 at 02:45:30PM +0300, Nazir Bilal Yavuz wrote:
> > Also, if I change this code to:
> >
> > if (cstate->simd_enabled)
> > {
> > if (is_csv)
> > result = CopyReadLineText(cstate, true, true);
> > else
> > result = CopyReadLineText(cstate, false, true);
> > }
> > else
> > {
> > if (is_csv)
> > result = CopyReadLineText(cstate, true, false);
> > else
> > result = CopyReadLineText(cstate, false, false);
> > }
> >
> > then I see ~%5 performance improvement in scalar path compared to master.
>
> Hm. What difference do you see if you just do
>
> if (is_csv)
> result = CopyReadLineText(cstate, true);
> else
> result = CopyReadLineText(cstate, false);
>
> both with and without the SIMD stuff? IIUC this is allowing the compiler
> to remove several branches in CopyReadLineText(), which might be a nice
> improvement on its own. That being said, I'm less convinced that adding a
> simd_enabled parameter to CopyReadLineText() helps, because 1) it's
> involved in fewer branches and 2) we change it within the function, so the
> compiler can't remove the branches, anyway. But perhaps I'm missing
> something.
>
> Some other random thoughts:
>
> + match = vector8_or(vector8_eq(chunk, nl),
> vector8_eq(chunk, cr));
>
> + match = vector8_or(vector8_eq(chunk, nl),
> vector8_eq(chunk, cr));
>
> Since \n and \r are well below "normal" ASCII values, I wonder if we could
> simplify these to something like
>
> match = vector8_gt(... vector with all lanes set to \r + 1 ...,
> chunk);
>
> + /* Check if we found any special characters */
> + mask = vector8_highbit_mask(match);
> + if (mask != 0)
>
> vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
> waiting until we enter the "if" block to calculate it has any benefit.
>
> + simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv ||
> !in_quote);
>
> If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
> place, right?
>
> + simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
> +
> + /*
> + * Do not disable SIMD when we hit EOL or EOF characters.
> In
> + * practice, it does not matter for EOF because parsing
> ends
> + * there, but we keep the behavior consistent.
> + */
> + if (!(simd_hit_eof || simd_hit_eol))
>
> I'd think that doing less unnecessary work would outweigh the benefits of
> consistency for the EOF case.
>
> --
> nathan
>
--
-- Manni Wood EDB: https://www.enterprisedb.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chengpeng Yan | 2026-02-14 05:39:33 | Re: Add a greedy join search algorithm to handle large join problems |
| Previous Message | Sami Imseih | 2026-02-14 02:23:01 | Re: Flush some statistics within running transactions |