Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>
To: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2025-11-13 02:40:35
Message-ID: CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 12, 2025 at 8:44 AM KAZAR Ayoub <ma_kazar(at)esi(dot)dz> wrote:

> On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>
> wrote:
>
>> Hello!
>>
>> I wanted reproduce the results using files attached by Shinya Kato and
>> Ayoub Kazar. I installed a postgres compiled from master, and then I
>> installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
>> applied.
>>
>> The master+v3patches postgres naturally performed better on copying into
>> the database: anywhere from 11% better for the t.csv file produced by
>> Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
>> created by Ayoub Kazar's simd-copy-from-bench.sql.
>>
>> But here's where it gets weird. The two files created by Ayoub Kazar's
>> simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
>> and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
>> respectively.
>>
>> This seems impossible.
>>
>> A few things I should note:
>>
>> I timed the commands using the Unix time command, like so:
>>
>> time psql -X -U mwood -h localhost -d postgres -c '\copy t from
>> /tmp/t_4096_escape.txt'
>>
>> For each file, I timed the copy 6 times and took the average.
>>
>> This was done on my work Linux machine while also running Chrome and an
>> Open Office spreadsheet; not a dedicated machine only running postgres.
>>
> Hello,
> I think if you do a perf benchmark (if it still reproduces) it would
> probably be possible to explain why it's performing like that looking at
> the CPI and other metrics and compare it to my findings.
> What i also suggest is to make the data close even closer to the worst
> case i.e: more special characters where it hurts the switching between SIMD
> and scalar processing (in simd-copy-from-bench.sql file), if still does a
> good job then there's something to look at.
>
>>
>>
>
>> All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
>> into postgres compiled from master) to 2 seconds (Ayoub
>> Kazar's t_4096_none.csv copied into postgres compiled from master plus
>> Nazir's v3 patches).
>>
>> Perhaps I need to fiddle with the provided SQL to produce larger files to
>> get longer run times? Maybe sub-second differences won't tell as
>> interesting a story as minutes-long copy commands?
>>
> I did try it on some GBs (around 2-5GB only), the differences were not
> that much, but if you can run this on more GBs (at least 10GB) it would be
> good to look at, although i don't suspect anything interesting since the
> shape of data is the same for the totality of the COPY.
>
>>
>> Thanks for reading this.
>> --
>> -- Manni Wood EDB: https://www.enterprisedb.com
>>
> Thanks for the info.
>
>
> Regards,
> Ayoub Kazar.
>

Hello again!

It looks like using 10 times the data removed the apparent speedup in the
simd code when the simd code has to deal with t_4096_escape.txt
and t_4096_quote.csv. When both files contain 1,000,000 lines each,
postgres master+v3patch imports 0.63% slower and 0.54% slower respectively.
For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup.
For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup.

I got these numbers just via simple timing, though this time I used psql's
\timing feature. I left psql running rather than launching it each time as
I did when I used the unix "time" command. I ran the copy command 5 times
for each file and averaged the results. Again, this happened on a Linux
machine that also happened to be running Chrome and Open Office's
spreadsheet.

I should probably try to construct some .txt or .csv files that would trip
up the simd on/off heuristic in the v3 patch.

If data "in the wild" tend to be roughly the same "shape" from row to row,
as Andrew's experience has shown, I imagine these million row results bode
well for the v3 patch...
--
-- Manni Wood EDB: https://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message wenhui qiu 2025-11-13 02:46:44 Re: Doc: add XML ID attributes to <varlistentry> tags for create_foreign_table, alter_foreign_table
Previous Message Zhijie Hou (Fujitsu) 2025-11-13 02:33:38 RE: Newly created replication slot may be invalidated by checkpoint