| From: | Manni Wood <manni(dot)wood(at)enterprisedb(dot)com> |
|---|---|
| To: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz> |
| Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date: | 2025-11-13 02:40:35 |
| Message-ID: | CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Nov 12, 2025 at 8:44 AM KAZAR Ayoub <ma_kazar(at)esi(dot)dz> wrote:
> On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>
> wrote:
>
>> Hello!
>>
>> I wanted reproduce the results using files attached by Shinya Kato and
>> Ayoub Kazar. I installed a postgres compiled from master, and then I
>> installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
>> applied.
>>
>> The master+v3patches postgres naturally performed better on copying into
>> the database: anywhere from 11% better for the t.csv file produced by
>> Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
>> created by Ayoub Kazar's simd-copy-from-bench.sql.
>>
>> But here's where it gets weird. The two files created by Ayoub Kazar's
>> simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
>> and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
>> respectively.
>>
>> This seems impossible.
>>
>> A few things I should note:
>>
>> I timed the commands using the Unix time command, like so:
>>
>> time psql -X -U mwood -h localhost -d postgres -c '\copy t from
>> /tmp/t_4096_escape.txt'
>>
>> For each file, I timed the copy 6 times and took the average.
>>
>> This was done on my work Linux machine while also running Chrome and an
>> Open Office spreadsheet; not a dedicated machine only running postgres.
>>
> Hello,
> I think if you do a perf benchmark (if it still reproduces) it would
> probably be possible to explain why it's performing like that looking at
> the CPI and other metrics and compare it to my findings.
> What i also suggest is to make the data close even closer to the worst
> case i.e: more special characters where it hurts the switching between SIMD
> and scalar processing (in simd-copy-from-bench.sql file), if still does a
> good job then there's something to look at.
>
>>
>>
>
>> All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
>> into postgres compiled from master) to 2 seconds (Ayoub
>> Kazar's t_4096_none.csv copied into postgres compiled from master plus
>> Nazir's v3 patches).
>>
>> Perhaps I need to fiddle with the provided SQL to produce larger files to
>> get longer run times? Maybe sub-second differences won't tell as
>> interesting a story as minutes-long copy commands?
>>
> I did try it on some GBs (around 2-5GB only), the differences were not
> that much, but if you can run this on more GBs (at least 10GB) it would be
> good to look at, although i don't suspect anything interesting since the
> shape of data is the same for the totality of the COPY.
>
>>
>> Thanks for reading this.
>> --
>> -- Manni Wood EDB: https://www.enterprisedb.com
>>
> Thanks for the info.
>
>
> Regards,
> Ayoub Kazar.
>
Hello again!
It looks like using 10 times the data removed the apparent speedup in the
simd code when the simd code has to deal with t_4096_escape.txt
and t_4096_quote.csv. When both files contain 1,000,000 lines each,
postgres master+v3patch imports 0.63% slower and 0.54% slower respectively.
For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup.
For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup.
I got these numbers just via simple timing, though this time I used psql's
\timing feature. I left psql running rather than launching it each time as
I did when I used the unix "time" command. I ran the copy command 5 times
for each file and averaged the results. Again, this happened on a Linux
machine that also happened to be running Chrome and Open Office's
spreadsheet.
I should probably try to construct some .txt or .csv files that would trip
up the simd on/off heuristic in the v3 patch.
If data "in the wild" tend to be roughly the same "shape" from row to row,
as Andrew's experience has shown, I imagine these million row results bode
well for the v3 patch...
--
-- Manni Wood EDB: https://www.enterprisedb.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | wenhui qiu | 2025-11-13 02:46:44 | Re: Doc: add XML ID attributes to <varlistentry> tags for create_foreign_table, alter_foreign_table |
| Previous Message | Zhijie Hou (Fujitsu) | 2025-11-13 02:33:38 | RE: Newly created replication slot may be invalidated by checkpoint |