From: | KAZAR Ayoub <ma_kazar(at)esi(dot)dz> |
---|---|
To: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, "nathandbossart(at)gmail(dot)com" <nathandbossart(at)gmail(dot)com>, "ants(dot)aasma(at)cybertec(dot)at" <ants(dot)aasma(at)cybertec(dot)at> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Speed up COPY FROM text/CSV parsing using SIMD |
Date: | 2025-10-21 06:17:01 |
Message-ID: | CA+K2RumH-b=3-v0rfQ-oAbuQFxY8JLSSpVhmaJn+gRnX3t1_vg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Oct 18, 2025 at 10:01 PM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
wrote:
> Thank you so much for doing this! The results look nice, do you think
> there are any other benchmarks that might be interesting to try?
>
> > I'm also trying the idea of doing SIMD inside quotes with prefix XOR
> using carry less multiplication avoiding the slow path in all cases even
> with weird looking input, but it needs to take into consideration the
> availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we
> go, it quickly starts to become dirty OR we can wait for the decision to
> start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
>
> I can not quite picture this, would you mind sharing a few examples or
> patches?
>
The idea aims to avoid stopping at characters that are not actually special
in their position (inside quote, escaped ..etc)
This is done by creating a lot of masks from the original chunk, masks
like: quote_mask, escape_mask, odd escape sequences mask ; from these we
can deduce which quotes are not special to stop at
Then for inside quotes, we aim to know which characters in our chunk are
inside quotes (also keeping in track the previous chunk's quote state) and
there's a clever/fast way to do it [1].
After this you start to match with LF and CR ..etc, all this while
maintaining the state of what you've seen (the annoying part).
At the end you only reach the scalar path advancing by the position of
first real special character that requires special treatment.
However, after trying to implement this on the existing pipeline way of
COPY command [2] (broken hopeless try, but has the idea), It becomes very
unreasonable for a lot of reasons:
- It is very challenging to correctly handle commas inside quoted fields,
and tracking quoted vs. unquoted state (especially across chunk boundaries,
or with escaped quotes) ....
- Using carry less multiplication (CLMUL) for prefix xor on a 16 bytes
chunk is overkill for some architectures where PCLMULQDQ latency is high
[3][4] to a point where it performs worse than an unrolled shifts + xor (5
cycles).
- It starts to feel that handling these cases is inherently scalar, doing
all that work for a 16 bytes chunk would be unreasonable since it's not
free, compared to a simple help using SIMD and heuristic of Nazir which is
way nicer in general.
Currently we are at 200-400Mbps which isn't that terrible compared to
production and non production grade parsers (of course we don't only parse
in our case), also we are using SSE2 only so theoretically if we add
support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it more.
[1]
https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
[2]
https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf
[3] https://agner.org/optimize/instruction_tables.pdf
[4] https://www.uops.info/table.html
Regards,
Ayoub Kazar.
From | Date | Subject | |
---|---|---|---|
Next Message | KAZAR Ayoub | 2025-10-21 06:44:06 | Re: Speed up COPY FROM text/CSV parsing using SIMD |
Previous Message | Xuneng Zhou | 2025-10-21 06:13:26 | Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array |