Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc:	KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2026-02-11 13:27:50
Message-ID:	CAN55FZ1=O6TjeZM2CUT7T2tu66uJT+w3G9FiRXVs+gt_ousFxQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On Sat, 7 Feb 2026 at 01:47, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Sat, Feb 07, 2026 at 01:19:16AM +0300, Nazir Bilal Yavuz wrote:
> > I have three possible approaches in my mind, they are actually similar
> > to each other.
> >
> > 1- After encountering a special character, disable SIMD for the rest
> > of the current line and also for the rest of the data.
> >
> > 2- It is a mixed version of the current heuristic and #1. After
> > encountering a special character, skip SIMD for the current line (let'
> > say line 1) and for the next line (line 2). Then try running SIMD for
> > the next line (line 3), if there is no special character continue to
> > run SIMD but if there is a special character then skip running SIMD
> > for two lines this time. And it goes like that, everytime special
> > character is encountered in the SIMD run, skipped SIMD lines are
> > doubled.
> >
> > 3- This version is a bit different from #2. Instead of calculating the
> > number of lines to skip dynamically, skip the constant N number of
> > lines and then try to run SIMD again after these lines. N could be
> > something like 100, 1000, or 10000 etc.. Actually, you and Andrew
> > suggested this approach before [1].
> >
> > I think what you suggested is closer to #1 or #3. I just wanted to
> > hear your opinions, and whether you think any of these approaches are
> > good to implement / work on.
>
> Yeah, I think either (1) or (3) would be a good starting point. (1) is
> basically just (3) with N set to infinity, anyway. I imagine there's some
> value less than infinity that is acceptable, but if I had to pick an
> approach right now, I'd probably go with (1) to essentially remove the
> heuristic from the discussion until we're ready to focus on it.

I am sharing a v6 which implements (1). My benchmark results show
almost no difference for the special-character cases and a nice
improvement for the no-special-character cases.

Timing results after running Manni's v1.2.1 benchmark:

+---------+---------------+----------------+--------------+----------------+
| | text | no sp. | text | 1/3 sp. | csv | no sp. | csv | 1/3 sp. |
+---------+---------------+----------------+--------------+----------------+
| master | 104437 | 118711 | 121173 | 151589 |
+---------+---------------+----------------+--------------+----------------+
| patched | 90062 -%13.7 | 119070 +%0.003 | 88964 -%26.5 | 153710 +%0.013 |
+---------+---------------+----------------+--------------+----------------+

In case the table does not render well in your email client, here is a
short summary:

- Text, no special characters: 13.7% faster
- Text, 1/3 special characters: %0.003 slower, no meaningful change

- CSV, no special characters: 26.5% faster
- CSV, 1/3 special characters: %0.013 slower, no meaningful change

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment	Content-Type	Size
v6-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch	text/x-patch	8.0 KB

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2026-02-06 22:47:18 from Nathan Bossart

Responses

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2026-02-11 22:39:43 from Nathan Bossart

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2026-02-11 14:06:29	Re: Do we still need MULE_INTERNAL?
Previous Message	Ants Aasma	2026-02-11 12:42:36	Re: pg_stat_io_histogram