Re: Speed up COPY FROM text/CSV parsing using SIMD

From: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
To: Neil Conway <neil(dot)conway(at)gmail(dot)com>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2026-01-31 16:20:58
Message-ID: CA+K2Ru=C_woAnd-3-pGHoNSTR8FOf=7eeSWE1xaLt9ojVWndVg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

On Wed, Jan 21, 2026 at 9:50 PM Neil Conway <neil(dot)conway(at)gmail(dot)com> wrote:

> A few suggestions:
>
> * I'm curious if we'll see better performance on large inputs if we flush
> to `line_buf` periodically (e.g., at least every few thousand bytes or so).
> Otherwise we might see poor data cache behavior if large inputs with no
> control characters get evicted before we've copied them over. See the
> approach taken in escape_json_with_len() in utils/adt/json.c
>
> So i gave this a try, attached is the small patch that has v3 + the
suggestion added, here are the results with different threshold for
line_buf refill:

Execution time compared to master:
Workload v3 v3.1 (2k) v3.1 (4k) v3.1 (8k) v3.1 (16k) v3.1 (20k) v3.1 (28k)
text/none -16.5% -17.4% -14.3% -12.6% -13.6% -10.5% -16.3%
text/esc +5.6% +11.1% +3.1% +7.6% +3.0% +4.9% +4.2%
csv/none -31.0% -29.9% -26.7% -30.1% -27.9% -30.2% -29.6%
csv/quote +0.2% -0.6% -0.4% -1.0% +0.1% +2.5% -1.0%

L1d cache miss rates:
Workload Master v3 v3.1 (2k) v3.1 (4k) v3.1 (8k) v3.1 (16k) v3.1 (20k) v3.1
(28k)
text/none 0.20% 0.23% 0.21% 0.22% 0.21% 0.21% 0.21% 0.22%
text/esc 0.21% 0.22% 0.22% 0.22% 0.22% 0.21% 0.22% 0.22%
csv/none 0.17% 0.22% 0.21% 0.22% 0.21% 0.21% 0.22% 0.22%
csv/quote 0.18% 0.22% 0.19% 0.20% 0.20% 0.19% 0.20% 0.20%
On my laptop I have 32KB L1 cache per core.
Results are super close, it is hard to see in the cache misses numbers but
execution times are saying other things, doing the periodic filling of
line_buf seems good to do.
If Manni can rerun the benchmarks on these too, it would be nice to confirm
this.

Regards,
Ayoub

Attachment Content-Type Size
0001-COPY-from-SIMD-v3-with-line_buf-periodic-refill.patch application/x-patch 8.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2026-01-31 16:21:49 Re: pg_dumpall --roles-only interact with other options
Previous Message Tom Lane 2026-01-31 15:54:33 Re: ABI Compliance Checker GSoC Project