Re: Speed up COPY TO text/CSV parsing using SIMD

From: KAZAR Ayoub <ma_kazar(at)esi(dot)dz>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Neil Conway <neil(dot)conway(at)gmail(dot)com>, Manni Wood <manni(dot)wood(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, Mark Wong <markwkm(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: Speed up COPY TO text/CSV parsing using SIMD
Date: 2026-03-24 00:16:37
Message-ID: CA+K2Runt9Pfst31BBmX9pya-2AwvxgoRwZQP-PcEyg4Hoejbug@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 18, 2026 at 3:29 AM KAZAR Ayoub <ma_kazar(at)esi(dot)dz> wrote:

> On Wed, Mar 18, 2026 at 12:02 AM KAZAR Ayoub <ma_kazar(at)esi(dot)dz> wrote:
>
>> On Tue, Mar 17, 2026 at 7:49 PM Nathan Bossart <nathandbossart(at)gmail(dot)com>
>> wrote:
>>
>>> On Sat, Mar 14, 2026 at 11:43:38PM +0100, KAZAR Ayoub wrote:
>>> > Just a small concern about where some varlenas have a larger binary
>>> size
>>> > than its text representation ex:
>>> > SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
>>> > pg_column_size
>>> > ----------------
>>> > 32
>>> >
>>> > its text representation is less than sizeof(Vector8) so currently v3
>>> would
>>> > enter SIMD path and exit out just from the beginning (two extra
>>> branches)
>>> > because it does this:
>>> > + if (TupleDescAttr(tup_desc, attnum - 1)->attlen == -1 &&
>>> > + VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))
>>> >
>>> > I thought maybe we could do * 2 or * 4 its binary size, depends on the
>>> type
>>> > really but this is just a proposition if this case is something
>>> concerning.
>>>
>>> Can we measure the impact of this? How likely is this case?
>>>
>> I'll respond to this separately in a different email.
>>
> My example was already incorrect (the text representation is lexems and
> positions, not the text we read as it is, its lossy), anyways the point
> still holds.
> If we have some json(b) column like : {"key1":"val1","key2":"val2"}, for
> CSV format this would immediately exit the SIMD path because of quote
> character, for json(b) this is going to be always the case.
> I measured the overhead of exiting the SIMD path a lot (8 million times
> for one COPY TO command), i only found 3% regression for this case,
> sometimes 2%.
>
> For cases where we do a false commitment on SIMD because we read a binary
> size >= sizeof(Vector8), which i found very niche too, the short circuit to
> scalar each time is even more negligible (the above CSV JSON case is the
> absolute worst case).
> So I don't think any of this should be a concern.
>
>
> Regards,
> Ayoub
>
Rebased patch.

Regards,
Ayoub

Attachment Content-Type Size
v3-0001-Speed-up-COPY-TO-FORMAT-text-csv-using-SIMD.patch text/x-patch 12.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2026-03-24 00:32:40 Re: another autovacuum scheduling thread
Previous Message Masahiko Sawada 2026-03-24 00:13:57 Re: [PATCH] Add max_logical_replication_slots GUC