| From: | "Greg Burd" <greg(at)burd(dot)me> |
|---|---|
| To: | "Nathan Bossart" <nathandbossart(at)gmail(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de> |
| Cc: | "John Naylor" <johncnaylorls(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net> |
| Subject: | Re: Add RISC-V Zbb popcount optimization |
| Date: | 2026-03-27 20:22:00 |
| Message-ID: | 038b2469-776f-404b-ad7e-e85f45da2166@app.fastmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote:
> On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:
>> I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
>> all that effectively - hard to believe there's any real world workloads where
>> that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
>> world use of those platforms, making niche-y perf improvements somewhat
>> worthwhile. Whereas there's afaict not yet a whole lot of riscv production
>> adoption.
Hey Nathan,
> That work was partially motivated by vector stuff that used popcount
> functions pretty heavily, but yeah, the complexity compared to the gains is
> the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
> and Neon). I'd still consider using AVX-512, etc. for things if the impact
> on real-world workloads was huge, though.
Yes, that and by research done while trying to understand why my RISC-V build farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU: RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently.
> --
> nathan
Forgive me, while $subject only mentions popcount I couldn't help myself so I added a few more RISC-V patches including a bug fix that I hope makes greenfly happy again.
0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.
------> Join me in "the rabbit hole" on this issue if you care to...
The existing software DES (as shown by the build-farm animal "greenfly" [1]) fails because Clang 20 has an auto-vectorization bug that we trigger in the DES initialization code (des_init() function), not the DES encryption algorithm itself.
I searched the LLVM issue tracker, here are the issues that caught my eye:
1. Issue #176001 - "RISC-V Wrong code at -O1"
- Vector peephole optimization with vmerge folding
- Fixed by PR #176077 (merged Jan 2024)
- Link: https://github.com/llvm/llvm-project/issues/176001
2. Issue #187458 - "Wrong code for vector.extract.last.active"
- Large index issues with zvl1024b
- Partially fixed, still work ongoing
- Link: https://github.com/llvm/llvm-project/issues/187458
3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
- Illegal instruction from mismatched EEW
- Under investigation
- Link: https://github.com/llvm/llvm-project/issues/171978
4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
- Cost model fixes for scatter/gather (merged Jan 2026)
- Link: https://github.com/llvm/llvm-project/pull/176105
My fix in 0001 is simply adding this in a few places in crypt-des.c:
#if defined(__riscv) && defined(__clang__)
pg_memory_barrier();
#endif
While searching I ran across a different solution, adding `-mllvm -riscv-v-vector-bits-min=0` sets the minimum vector bit width for RISC-V vector extension in LLVM to 0 disabling all vectorization forcing scalar code generation, no RVV instructions are emitted. This would prevent the DES bug at the cost of any vectorization anywhere in the binary.
While that might also fix the other intermittent bug we'd been seeing on greenfly (not tested) disablnig all RVV optimizations seems to heavy handed to me.
------> Moving on.
0002 - (was "0001" in v2) this is unchanged, it implements popcount using Zbb extension on RISC-V
0003 - is a small patch that adapted from the Google Abseil project's RISC-V CRC32C implementation [1]. It is *a lot faster* than the software crc32c we fall back to now (see: riscv-crc32c.c). This algorithm requires the Zbc (or Zbkc) extension (for clmul) so the patch tests for that at build and adds the '-march' flag when it is. However, as is the case for Zbb and popcnt in, the presence of Zbc (or Zbkc) must be detected at runtime. That's done following the pre-existing pattern used for ARM features. This does introduce some runtime overhead and complexity, not more than required I hope.
I attached test code, and results at the end of this email:
* riscv-popcnt.c - unchanged
* riscv-crc32c.c - new, based on work in the Google Abseil project
* riscv-des.c - highlights the fix for DES using Clang on RISC-V
I guess the question for 002 and/or 003 is if the "juice" is worth the "squeeze" or not. There is a lot of performance juice to be had IMO. But some might argue that RISC-V isn't widely adopted yet, and they'd be right. Others might point out that RISC-V is currently showing up in embedded systems more than server/desktop/laptop/cloud, also true. However, there is some evidence that is changing as there are RISC-V in servers [2][3], and there is a hosted (cloud) solution from Scaleway [4]. There exists a 64 core RISC-V desktop [6] and a Framework laptop mainboard [7] sporting a RISC-V CPUs. And there is the OrangePi RV2 [7] I have that is "greenfly".
Is it early days? Certainly! But too early? That's up for debate. :)
If nothing else, these patches can be a durable record and used later when RISC-V is a critical platform for Postgres or informational to other projects.
best.
-greg
[1] https://github.com/abseil/abseil-cpp/pull/1986 absl/crc/internal/crc_riscv.cc
[2] https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
[3] https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
[4] https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
[5] https://milkv.io/pioneer and https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
[6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
[7] http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html
---- TEST PROGRAM OUTPUT:
gburd(at)rv:~/ws/postgres$ make -f Makefile.RISCV
gcc -O2 riscv-des.c -o des-gcc-sw
gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
clang-20 -O1 riscv-des.c -o des-clang-o1-sw
clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
clang-20 -O2 riscv-des.c -o des-clang-o2-sw
clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
gburd(at)rv:~/ws/postgres$ make -f Makefile.RISCV test
./des-gcc-sw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.409 seconds (409 ns/iter)
With barriers: 0.416 seconds (416 ns/iter)
Overhead: 1.6%
./des-gcc-hw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.410 seconds (410 ns/iter)
With barriers: 0.410 seconds (410 ns/iter)
Overhead: Negligible
./des-clang-o1-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.516 seconds (516 ns/iter)
Overhead: Negligible
./des-clang-o1-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.405 seconds (405 ns/iter)
With barriers: 0.405 seconds (405 ns/iter)
Overhead: Negligible
./des-clang-o2-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.518 seconds (518 ns/iter)
Overhead: Negligible
./des-clang-o2-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
ERROR: un_pbox mismatch:
un_pbox[0] = 15, expected 8
un_pbox[1] = 6, expected 16
un_pbox[2] = 19, expected 22
un_pbox[3] = 20, expected 30
un_pbox[4] = 28, expected 12
... and 27 more errors
FAIL: Permutation tables are incorrect
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.093 seconds (93 ns/iter)
With barriers: 0.407 seconds (407 ns/iter)
Overhead: 335.5%
./popcnt-gcc-o2-sw
sw popcount: 0.183 sec ( 547.89 MB/s)
hw popcount: 0.274 sec ( 365.40 MB/s)
diff: 0.67x
match: 406261900 bits counted
./popcnt-gcc-o2-hw
sw popcount: 0.182 sec ( 548.17 MB/s)
hw popcount: 0.044 sec ( 2287.82 MB/s)
diff: 4.17x
match: 406261900 bits counted
./popcnt-clang-o2-sw
sw popcount: 0.188 sec ( 531.96 MB/s)
hw popcount: 0.207 sec ( 482.84 MB/s)
diff: 0.91x
match: 406261900 bits counted
./popcnt-clang-o2-hw
sw popcount: 0.224 sec ( 446.46 MB/s)
hw popcount: 0.056 sec ( 1794.83 MB/s)
diff: 4.02x
match: 406261900 bits counted
./crc32c-gcc-o2-sw
sw crc32c: 0.651 sec ( 153.68 MB/s)
hw crc32c: 0.651 sec ( 153.72 MB/s)
diff: 1.00x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-gcc-o2-hw
sw crc32c: 0.651 sec ( 153.70 MB/s)
hw crc32c: 0.000 sec ( 308052.33 MB/s)
diff: 2004.21x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-sw
sw crc32c: 0.584 sec ( 171.10 MB/s)
hw crc32c: 0.584 sec ( 171.17 MB/s)
diff: 1.00x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-hw
sw crc32c: 0.584 sec ( 171.15 MB/s)
hw crc32c: 0.000 sec ( 309282.38 MB/s)
diff: 1807.08x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
| Attachment | Content-Type | Size |
|---|---|---|
| Makefile.RISCV | application/octet-stream | 2.2 KB |
| riscv-crc32c.c | text/x-csrc | 7.9 KB |
| riscv-des.c | text/x-csrc | 6.1 KB |
| riscv-popcnt.c | text/x-csrc | 2.2 KB |
| v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch | text/x-patch | 2.7 KB |
| v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patch | text/x-patch | 10.7 KB |
| v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch | text/x-patch | 19.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Borisov | 2026-03-27 20:37:39 | Re: Improve the performance of Unicode Normalization Forms. |
| Previous Message | Andres Freund | 2026-03-27 20:00:47 | Re: Buffer locking is special (hints, checksums, AIO writes) |