| From: | Haibo Yan <tristan(dot)yim(at)gmail(dot)com> |
|---|---|
| To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: vectorized CRC on ARM64 |
| Date: | 2026-03-18 03:34:40 |
| Message-ID: | C3ADF28D-E6D4-41D2-ADE2-C7DD53EA8A5C@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi John
Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.
> On Jan 12, 2026, at 1:27 AM, John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Wed, May 14, 2025 I wrote:
>>
>> We did something similar for x86 for v18, and here is some progress
>> towards Arm support.
>
> Coming back to this, since there's been recent interest in Arm support.
>
> v2 is a rebase, with a few changes.
>
> - I simplified it by leaving out the inlining for "assume CRC" builds,
> since I wanted to avoid alignment considerations if I can. I think
> always indirecting through a pointer will have less risk of
> regressions in a realistic setting than for x86 since Arm chips
> typically have low latency for carryless multiplication instructions.
> With just a bit of code we can still use the direct call for small
> constant inputs, so I did that to avoid regressions under WAL insert
> lock.
>
> - One coding idiom for a vector literal in the generated code was
> giving pgindent indigestion, I so rewrote it using Neon intrinsics and
> verified it in Godbolt.
>
>> 0002: Like 3c6e8c12389 and in fact uses the same program to generate
>> the code, by specifying Neon instructions with the Arm "crypto"
>> extension instead. There are some interesting differences from x86
>> here as well:
>> - The upstream implementation chose to use inline assembly instead of
>> intrinsics for some reason. I initially thought that was a way to get
>> broader compiler support, but it turns out you still need to pass the
>> relevant flags to get the assembly to link.
Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics.
Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?
> To follow-up for curiosity's sake, [1] says that Apple chips can issue
> PMULL + EOR as a single uop if they are next to each other in the
> instruction stream.
>
>> - I only have Meson support for now, since I used MacOS on CI to test.
>> That OS and compiler combination apparently targets the CRC extension,
>> but the PMULL instruction runtime check uses Linux-only headers, I
>> believe, so previously I hacked the choose function to return true for
>> testing. The choose function in 0002 is untested in this form.
>
> This is still true, but now the CI hack lives in a separate
> not-for-commit patch for clarity.
>
> autoconf support is a WIP, and I will share that after I do some
> testing on an Arm Linux instance.
>
> [1] https://dougallj.github.io/applecpu/firestorm.html
>
> --
> John Naylor
> Amazon Web Services
> <v2-0001-Compute-CRC32C-on-ARM-using-the-Crypto-Extension-.patch><v2-0002-Force-testing-on-MacOS-CI-XXX-not-for-commit.patch>
Regards
Haibo
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bertrand Drouvot | 2026-03-18 03:57:48 | Re: relfilenode statistics |
| Previous Message | Corey Huinker | 2026-03-18 03:11:18 | Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions |