Re: Speed up COPY FROM text/CSV parsing using SIMD

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, KAZAR Ayoub <ma_kazar(at)esi(dot)dz>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: 2025-10-22 19:24:59
Message-ID: aPkvi5P7kpA8oQKc@nathan
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>> I wonder if we could mitigate the regression further by spacing out the
>> checks a bit more. It could be worth comparing a variety of values to
>> identify what works best with the test data.
>
> Do you mean that instead of doubling the SIMD sleep, we should
> multiply it by 3 (or another factor)? Or are you referring to
> increasing the maximum sleep from 1024? Or possibly both?

I'm not sure of the precise details, but the main thrust of my suggestion
is to assume that whatever sampling you do to determine whether to use SIMD
is good for a larger chunk of data. That is, if you are sampling 1K lines
and then using the result to choose whether to use SIMD for the next 100K
lines, we could instead bump the latter number to 1M lines (or something).
That way we minimize the regression for relatively uniform data sets while
retaining some ability to adapt in case things change halfway through a
large table.

--
nathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2025-10-22 19:34:49 Re: another autovacuum scheduling thread
Previous Message Nathan Bossart 2025-10-22 18:58:17 Re: another autovacuum scheduling thread