Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>
To: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
Cc: Neil Conway <neil(dot)conway(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2026-02-02 14:17:55
Message-ID: CAKWEB6p5+3VL4s61=zD4UBFp4ybNo1NrBnBw+avXsxgjBREqew@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 31, 2026 at 10:21 AM KAZAR Ayoub <ma_kazar(at)esi(dot)dz> wrote:

> Hello,
>
> On Wed, Jan 21, 2026 at 9:50 PM Neil Conway <neil(dot)conway(at)gmail(dot)com> wrote:
>
>> A few suggestions:
>>
>> * I'm curious if we'll see better performance on large inputs if we flush
>> to `line_buf` periodically (e.g., at least every few thousand bytes or so).
>> Otherwise we might see poor data cache behavior if large inputs with no
>> control characters get evicted before we've copied them over. See the
>> approach taken in escape_json_with_len() in utils/adt/json.c
>>
>> So i gave this a try, attached is the small patch that has v3 + the
> suggestion added, here are the results with different threshold for
> line_buf refill:
>
> Execution time compared to master:
> Workload v3 v3.1 (2k) v3.1 (4k) v3.1 (8k) v3.1 (16k) v3.1 (20k) v3.1 (28k)
> text/none -16.5% -17.4% -14.3% -12.6% -13.6% -10.5% -16.3%
> text/esc +5.6% +11.1% +3.1% +7.6% +3.0% +4.9% +4.2%
> csv/none -31.0% -29.9% -26.7% -30.1% -27.9% -30.2% -29.6%
> csv/quote +0.2% -0.6% -0.4% -1.0% +0.1% +2.5% -1.0%
>
> L1d cache miss rates:
> Workload Master v3 v3.1 (2k) v3.1 (4k) v3.1 (8k) v3.1 (16k) v3.1 (20k) v3.1
> (28k)
> text/none 0.20% 0.23% 0.21% 0.22% 0.21% 0.21% 0.21% 0.22%
> text/esc 0.21% 0.22% 0.22% 0.22% 0.22% 0.21% 0.22% 0.22%
> csv/none 0.17% 0.22% 0.21% 0.22% 0.21% 0.21% 0.22% 0.22%
> csv/quote 0.18% 0.22% 0.19% 0.20% 0.20% 0.19% 0.20% 0.20%
> On my laptop I have 32KB L1 cache per core.
> Results are super close, it is hard to see in the cache misses numbers but
> execution times are saying other things, doing the periodic filling of
> line_buf seems good to do.
> If Manni can rerun the benchmarks on these too, it would be nice to
> confirm this.
>
>
> Regards,
> Ayoub
>

Hello, All!

Ayoub, I will try to benchmark v3.1 this week on my standalone x86 and arm
PCs. Sadly, other work has been taking priority these last couple weeks,
but I will carve out some time.

Neil, thanks so much for looking at this patch!

-Manni
--
-- Manni Wood EDB: https://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2026-02-02 14:34:47 Re: race condition when writing pg_control
Previous Message John Naylor 2026-02-02 14:16:42 Re: refactor architecture-specific popcount code