Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2025-10-21 18:40:09
Message-ID: aPfTiX0HwV42R6Od@nathan
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote:
> I think the problem is deciding how many lines to process before
> deciding for the rest. 1000 lines could work for the small sized data
> but it might not work for the big sized data. Also, it might cause a
> worse regressions for the small sized data.

IMHO we have some leeway with smaller amounts of data. If COPY FROM for
1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems
unlikely users would be inconvenienced all that much. (Those numbers are
completely made up in order to illustrate my point.)

> Because of this reason, I
> tried to implement a heuristic that will work regardless of the size
> of the data. The last heuristic I suggested will run SIMD for
> approximately (#number_of_lines / 1024 [1024 is the max number of
> lines to sleep before running SIMD again]) lines if all characters in
> the data are special characters.

I wonder if we could mitigate the regression further by spacing out the
checks a bit more. It could be worth comparing a variety of values to
identify what works best with the test data.

--
nathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-10-21 18:55:07 Re: Speed up COPY FROM text/CSV parsing using SIMD
Previous Message Jeff Davis 2025-10-21 18:28:04 downcase_identifier(): use method table from locale provider