Quick Links

Re: Speed up COPY FROM text/CSV parsing using SIMD

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date:	2025-10-20 20:31:58
Message-ID:	8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
> On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
>> On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
>>> With this heuristic the regression is limited by %2 in the worst case.
>> My worry is that the worst case is actually quite common. Sparse data sets
>> dominated by a lot of null values (and hence lots of special characters) are
>> very common. Are people prepared to accept a 2% regression on load times for
>> such data sets?
> Without knowing how common it is, I think it's difficult to judge whether
> 2% is a reasonable trade-off. If <5% of workloads might see a small
> regression while the other >95% see double-digit percentage improvements,
> then I might argue that it's fine. But I'm not sure we have any way to
> know those sorts of details at the moment.

I guess what I don't understand is why we actually need to do the test
continuously, even using an adaptive algorithm. Data files in my
experience usually have lines with fairly similar shapes. It's highly
unlikely that you will get the the first 1000 (say) lines of a file that
are rich in special characters and then some later significant section
that isn't, or vice versa. Therefore, doing the test once should yield
the correct answer that can be applied to the rest of the file. That
should reduce the worst case regression to ~0% without sacrificing any
of the performance gains. I appreciate the elegance of what Bilal has
done here, but it does seem like overkill.

> I'm also at least a little skeptical about the 2% number. IME that's
> generally within the noise range and can vary greatly between machines and
> test runs.
>

Fair point.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-10-20 17:04:03 from Nathan Bossart

Responses

Re: Speed up COPY FROM text/CSV parsing using SIMD at 2025-10-20 21:09:27 from Nazir Bilal Yavuz

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Benjamin Leff	2025-10-20 20:43:48	Client-only Meson Build From Sources
Previous Message	Tom Lane	2025-10-20 20:14:38	Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()