Re: Speed up COPY FROM text/CSV parsing using SIMD

From: "Greg Burd" <greg(at)burd(dot)me>
To: "Nathan Bossart" <nathandbossart(at)gmail(dot)com>, "Nazir Bilal Yavuz" <byavuz81(at)gmail(dot)com>
Cc: "Manni Wood" <manni(dot)wood(at)enterprisedb(dot)com>, "KAZAR Ayoub" <ma_kazar(at)esi(dot)dz>, "Neil Conway" <neil(dot)conway(at)gmail(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Shinya Kato" <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2026-03-13 15:29:35
Message-ID: 43de48dc-701b-4735-881b-50bca6870f39@app.fastmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Fri, Mar 13, 2026, at 10:05 AM, Nathan Bossart wrote:
> On Fri, Mar 13, 2026 at 04:34:49PM +0300, Nazir Bilal Yavuz wrote:
>> On Fri, 13 Mar 2026 at 14:57, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
>>> Unfortunately, v15 causes a regression for a 'csv & wide & 1/3' case
>>> on my end. v14 was taking 8000ms but v15 took ~9100ms. If we add the
>>> tmp_hit_eof variable then the regression disappears. Also, if I use a
>>> struct like below, regression disappears again.
>>
>>> When I removed the tmp_hit_eof variable on v14, I didn't encounter any
>>> regression. I really don't understand why this is happening on my end.
>>> Manni didn't encounter any regression on the benchmark [1].
>>
>> Problem might be related to gcc. I am using Debian Trixie and my
>> current gcc version is 'gcc version 14.2.0 (Debian 14.2.0-19)'. If I
>> compile Postgres with 'Debian clang version 19.1.7 (3+b1)', then there
>> is no regression, which makes more sense IMO.
>
> Let's just re-add the temporary variable for hit_eof. The struct idea is
> clever, but it's just a little more complicated than I think is necessary
> here.
>
> I've also removed the goto in favor of just duplicating the "out" code,
> like you had before. I'd like to avoid sporadic #ifndef USE_NO_SIMD uses,
> and goto is out of fashion, anyway.

Hey Nathan, Nazir, et. al.,

I've always been a fan of these kinds of optimization so I couldn't resist reviewing, but I know you're ready to commit so I'll just check on some systems I have. :)

At first glance the implementation seems conservative, but correct and safe. Local testing on on Linux/FreeBSD x86_64, and Win11/aarch64/MSVC seem good. I also tried IllumOS/SPARCv9 and with some fixes (from another active thread) to the build system and it worked just fine too. I'm sure the 10 people care will be thrilled. ;-P

I ran into the "out:" label not defined because USE_NO_SIMD wasn't defined when I tested v15 on Linux/RISC-V, my build-farm animal "greenfly" which is an Orange Pi RV2 running Linux 24.04.4 LTS. Looks like changes this morning have fixed that.

I would have reported earlier this morning but I detoured into RISC-V specific fixes and optimizations. Interestingly, simd.h uses a fixed-width Vector8 type (16 bytes on SSE2/NEON) and relies on sizeof(Vector8) throughout, however RISC-V Vector types like vuint8m1_t are "sizeless" meaning that you cannot use sizeof() on them. To model that properly would require overhauling all of simd.h, which is way out of scope for this patch so I'll start a new thread for that (and CRC, and popcnt). I also created a few tests (attached) to check boundary conditions, I might add some along with the RISC-V work.

nice work, LGTM.

-greg

> --
> nathan
>
> Attachments:
> * v17-0001-Optimize-COPY-FROM-FORMAT-text-csv-using-SIMD.patch

Attachment Content-Type Size
test_simd_copy_boundaries.sql application/sql 7.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2026-03-13 15:33:09 Re: [PATCH] Silence a new Valgrind warning
Previous Message Jelte Fennema-Nio 2026-03-13 15:11:45 Re: Add "format" target to make and ninja to run pgindent and pgperltidy