Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2025-10-20 21:09:27
Message-ID:	CAN55FZ2GonAeSJHn-c2nJgUO-v6sDMOQzn97evVdZbcHeu3ihw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On Mon, 20 Oct 2025 at 23:32, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>
> On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
>
> On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
>
> On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
>
> With this heuristic the regression is limited by %2 in the worst case.
>
> My worry is that the worst case is actually quite common. Sparse data sets
> dominated by a lot of null values (and hence lots of special characters) are
> very common. Are people prepared to accept a 2% regression on load times for
> such data sets?
>
> Without knowing how common it is, I think it's difficult to judge whether
> 2% is a reasonable trade-off. If <5% of workloads might see a small
> regression while the other >95% see double-digit percentage improvements,
> then I might argue that it's fine. But I'm not sure we have any way to
> know those sorts of details at the moment.
>
>
> I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive algorithm. Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will get the the first 1000 (say) lines of a file that are rich in special characters and then some later significant section that isn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the rest of the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains. I appreciate the elegance of what Bilal has done here, but it does seem like overkill.

I think the problem is deciding how many lines to process before
deciding for the rest. 1000 lines could work for the small sized data
but it might not work for the big sized data. Also, it might cause a
worse regressions for the small sized data. Because of this reason, I
tried to implement a heuristic that will work regardless of the size
of the data. The last heuristic I suggested will run SIMD for
approximately (#number_of_lines / 1024 [1024 is the max number of
lines to sleep before running SIMD again]) lines if all characters in
the data are special characters.

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-10-20 20:31:58 from Andrew Dunstan

Responses

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-10-21 18:40:09 from Nathan Bossart

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dimitrios Apostolou	2025-10-20 21:12:35	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
Previous Message	Álvaro Herrera	2025-10-20 21:08:21	Re: Add \pset options for boolean value display