vectorized CRC on ARM64

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: vectorized CRC on ARM64
Date: 2025-05-14 10:36:27
Message-ID: CANWCAZaKhE+RD5KKouUFoxx1EbUNrNhcduM1VQ=DkSDadNEFng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We did something similar for x86 for v18, and here is some progress
towards Arm support.

0001: Like e2809e3a101 -- inline small constant inputs to compensate
for the fact that 0002 will do a runtime check even if the usual CRC
extension is targeted. There is a difference from x86, however: On Arm
we currently align on 8-byte boundaries before looping on 8-byte
chunks. That requirement would prevent loop unrolling. We could use
4-byte chunks to get around that, but it's not clear which way is
best. I've coded it so it's easy to try both ways.

0002: Like 3c6e8c12389 and in fact uses the same program to generate
the code, by specifying Neon instructions with the Arm "crypto"
extension instead. There are some interesting differences from x86
here as well:
- The upstream implementation chose to use inline assembly instead of
intrinsics for some reason. I initially thought that was a way to get
broader compiler support, but it turns out you still need to pass the
relevant flags to get the assembly to link.
- I only have Meson support for now, since I used MacOS on CI to test.
That OS and compiler combination apparently targets the CRC extension,
but the PMULL instruction runtime check uses Linux-only headers, I
believe, so previously I hacked the choose function to return true for
testing. The choose function in 0002 is untested in this form.
- On x86 it could be fairly costly to align on a cacheline boundary
before beginning the main loop so I elected to skip that for short-ish
inputs in PG18. On Arm the main loop uses 4 16-byte accumulators, so
the patch acts like upsteam and always aligns on 16-byte boundaries.

0003: An afterthought regarding the above-mentioned alignment, this is
an alternative preamble that might shave a couple cycles for 4-byte
aligned inputs, e.g. WAL.

--
John Naylor
Amazon Web Services

Attachment Content-Type Size
v1-0001-Inline-CRC-computation-for-small-fixed-length-inp.patch application/x-patch 2.3 KB
v1-0002-Compute-CRC32C-on-ARM-using-the-Crypto-Extension-.patch application/x-patch 9.9 KB
v1-0003-WIP-Attempt-alignment-preamble-better-suited-to-W.patch application/x-patch 1.3 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2025-05-14 10:50:46 Re: Incremental Sort Cost Estimation Instability
Previous Message Amit Kapila 2025-05-14 09:33:33 Re: Small fixes needed by high-availability tools