| From: | "Greg Burd" <greg(at)burd(dot)me> |
|---|---|
| To: | "Nathan Bossart" <nathandbossart(at)gmail(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de> |
| Cc: | "John Naylor" <johncnaylorls(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net> |
| Subject: | Re: Add RISC-V Zbb popcount optimization |
| Date: | 2026-05-27 17:04:46 |
| Message-ID: | 3a222ec2-01bb-4798-99e2-eedaf6cae19b@app.fastmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Mar 27, 2026, at 4:22 PM, Greg Burd wrote:
> On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote:
>> On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:
>>> I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
>>> all that effectively - hard to believe there's any real world workloads where
>>> that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
>>> world use of those platforms, making niche-y perf improvements somewhat
>>> worthwhile. Whereas there's afaict not yet a whole lot of riscv production
>>> adoption.
>
> Hey Nathan,
>
>> That work was partially motivated by vector stuff that used popcount
>> functions pretty heavily, but yeah, the complexity compared to the gains is
>> the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
>> and Neon). I'd still consider using AVX-512, etc. for things if the impact
>> on real-world workloads was huge, though.
>
> Yes, that and by research done while trying to understand why my RISC-V
> build farm animal "greenfly" (OrangePi RV2 with a VisionFive 2 CPU:
> RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently.
>
>> --
>> nathan
>
> Forgive me, while $subject only mentions popcount I couldn't help
> myself so I added a few more RISC-V patches including a bug fix that I
> hope makes greenfly happy again.
>
>
> 0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.
>
> ------> Join me in "the rabbit hole" on this issue if you care to...
>
> The existing software DES (as shown by the build-farm animal "greenfly"
> [1]) fails because Clang 20 has an auto-vectorization bug that we
> trigger in the DES initialization code (des_init() function), not the
> DES encryption algorithm itself.
>
> I searched the LLVM issue tracker, here are the issues that caught my eye:
> 1. Issue #176001 - "RISC-V Wrong code at -O1"
> - Vector peephole optimization with vmerge folding
> - Fixed by PR #176077 (merged Jan 2024)
> - Link: https://github.com/llvm/llvm-project/issues/176001
> 2. Issue #187458 - "Wrong code for vector.extract.last.active"
> - Large index issues with zvl1024b
> - Partially fixed, still work ongoing
> - Link: https://github.com/llvm/llvm-project/issues/187458
> 3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
> - Illegal instruction from mismatched EEW
> - Under investigation
> - Link: https://github.com/llvm/llvm-project/issues/171978
> 4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
> - Cost model fixes for scatter/gather (merged Jan 2026)
> - Link: https://github.com/llvm/llvm-project/pull/176105
>
> My fix in 0001 is simply adding this in a few places in crypt-des.c:
>
> #if defined(__riscv) && defined(__clang__)
> pg_memory_barrier();
> #endif
>
> While searching I ran across a different solution, adding `-mllvm
> -riscv-v-vector-bits-min=0` sets the minimum vector bit width for
> RISC-V vector extension in LLVM to 0 disabling all vectorization
> forcing scalar code generation, no RVV instructions are emitted. This
> would prevent the DES bug at the cost of any vectorization anywhere in
> the binary.
>
> While that might also fix the other intermittent bug we'd been seeing
> on greenfly (not tested) disablnig all RVV optimizations seems to heavy
> handed to me.
>
>
> ------> Moving on.
>
> 0002 - (was "0001" in v2) this is unchanged, it implements popcount
> using Zbb extension on RISC-V
>
> 0003 - is a small patch that adapted from the Google Abseil project's
> RISC-V CRC32C implementation [1]. It is *a lot faster* than the
> software crc32c we fall back to now (see: riscv-crc32c.c). This
> algorithm requires the Zbc (or Zbkc) extension (for clmul) so the patch
> tests for that at build and adds the '-march' flag when it is.
> However, as is the case for Zbb and popcnt in, the presence of Zbc (or
> Zbkc) must be detected at runtime. That's done following the
> pre-existing pattern used for ARM features. This does introduce some
> runtime overhead and complexity, not more than required I hope.
>
> I attached test code, and results at the end of this email:
> * riscv-popcnt.c - unchanged
> * riscv-crc32c.c - new, based on work in the Google Abseil project
> * riscv-des.c - highlights the fix for DES using Clang on RISC-V
>
> I guess the question for 002 and/or 003 is if the "juice" is worth the
> "squeeze" or not. There is a lot of performance juice to be had IMO.
> But some might argue that RISC-V isn't widely adopted yet, and they'd
> be right. Others might point out that RISC-V is currently showing up
> in embedded systems more than server/desktop/laptop/cloud, also true.
> However, there is some evidence that is changing as there are RISC-V in
> servers [2][3], and there is a hosted (cloud) solution from Scaleway
> [4]. There exists a 64 core RISC-V desktop [6] and a Framework laptop
> mainboard [7] sporting a RISC-V CPUs. And there is the OrangePi RV2
> [7] I have that is "greenfly".
>
> Is it early days? Certainly! But too early? That's up for debate. :)
>
> If nothing else, these patches can be a durable record and used later
> when RISC-V is a critical platform for Postgres or informational to
> other projects.
Rebased and tested (v4) adding better support for RISC-V with a fix for DES and faster popcount and CRC32 when the CPU supports it.
best.
-greg
> best.
>
> -greg
>
> [1] https://github.com/abseil/abseil-cpp/pull/1986
> absl/crc/internal/crc_riscv.cc
> [2]
> https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
> [3]
> https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
> [4]
> https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
> [5] https://milkv.io/pioneer and
> https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
> [6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
> [7]
> http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html
>
>
> ---- TEST PROGRAM OUTPUT:
>
> gburd(at)rv:~/ws/postgres$ make -f Makefile.RISCV
> gcc -O2 riscv-des.c -o des-gcc-sw
> gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
> clang-20 -O1 riscv-des.c -o des-clang-o1-sw
> clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
> clang-20 -O2 riscv-des.c -o des-clang-o2-sw
> clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
> gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
> gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
> clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
> clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
> gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
> gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
> clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
> clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
> gburd(at)rv:~/ws/postgres$ make -f Makefile.RISCV test
> ./des-gcc-sw
> Compiler: GCC 13.3.0
> Target: RISC-V 64-bit
> Vector extension: Not enabled
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.409 seconds (409 ns/iter)
> With barriers: 0.416 seconds (416 ns/iter)
> Overhead: 1.6%
> ./des-gcc-hw
> Compiler: GCC 13.3.0
> Target: RISC-V 64-bit
> Vector extension: Enabled (RVV)
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.410 seconds (410 ns/iter)
> With barriers: 0.410 seconds (410 ns/iter)
> Overhead: Negligible
> ./des-clang-o1-sw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Not enabled
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.517 seconds (517 ns/iter)
> With barriers: 0.516 seconds (516 ns/iter)
> Overhead: Negligible
> ./des-clang-o1-hw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Enabled (RVV)
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.405 seconds (405 ns/iter)
> With barriers: 0.405 seconds (405 ns/iter)
> Overhead: Negligible
> ./des-clang-o2-sw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Not enabled
>
> Testing WITHOUT compiler barriers:
> PASS: Permutation tables are correct
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.517 seconds (517 ns/iter)
> With barriers: 0.518 seconds (518 ns/iter)
> Overhead: Negligible
> ./des-clang-o2-hw
> Compiler: Clang 20.1.2
> Target: RISC-V 64-bit
> Vector extension: Enabled (RVV)
>
> Testing WITHOUT compiler barriers:
> ERROR: un_pbox mismatch:
> un_pbox[0] = 15, expected 8
> un_pbox[1] = 6, expected 16
> un_pbox[2] = 19, expected 22
> un_pbox[3] = 20, expected 30
> un_pbox[4] = 28, expected 12
> ... and 27 more errors
> FAIL: Permutation tables are incorrect
>
> Testing WITH compiler barriers:
> PASS: Permutation tables are correct
>
> Performance Comparison (1000000 iterations):
> Without barriers: 0.093 seconds (93 ns/iter)
> With barriers: 0.407 seconds (407 ns/iter)
> Overhead: 335.5%
> ./popcnt-gcc-o2-sw
> sw popcount: 0.183 sec ( 547.89 MB/s)
> hw popcount: 0.274 sec ( 365.40 MB/s)
>
> diff: 0.67x
> match: 406261900 bits counted
> ./popcnt-gcc-o2-hw
> sw popcount: 0.182 sec ( 548.17 MB/s)
> hw popcount: 0.044 sec ( 2287.82 MB/s)
>
> diff: 4.17x
> match: 406261900 bits counted
> ./popcnt-clang-o2-sw
> sw popcount: 0.188 sec ( 531.96 MB/s)
> hw popcount: 0.207 sec ( 482.84 MB/s)
>
> diff: 0.91x
> match: 406261900 bits counted
> ./popcnt-clang-o2-hw
> sw popcount: 0.224 sec ( 446.46 MB/s)
> hw popcount: 0.056 sec ( 1794.83 MB/s)
>
> diff: 4.02x
> match: 406261900 bits counted
> ./crc32c-gcc-o2-sw
> sw crc32c: 0.651 sec ( 153.68 MB/s)
> hw crc32c: 0.651 sec ( 153.72 MB/s)
>
> diff: 1.00x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> ./crc32c-gcc-o2-hw
> sw crc32c: 0.651 sec ( 153.70 MB/s)
> hw crc32c: 0.000 sec ( 308052.33 MB/s)
>
> diff: 2004.21x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> ./crc32c-clang-o2-sw
> sw crc32c: 0.584 sec ( 171.10 MB/s)
> hw crc32c: 0.584 sec ( 171.17 MB/s)
>
> diff: 1.00x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> ./crc32c-clang-o2-hw
> sw crc32c: 0.584 sec ( 171.15 MB/s)
> hw crc32c: 0.000 sec ( 309282.38 MB/s)
>
> diff: 1807.08x
> match: 0x0B141F2D
>
> validation: CRC32C("123456789") = 0xE3069283 (correct)
> Attachments:
> * Makefile.RISCV
> * riscv-crc32c.c
> * riscv-des.c
> * riscv-popcnt.c
> * v3-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch
> * v3-0002-Add-RISC-V-popcount-using-Zbb-extension.patch
> * v3-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch
| Attachment | Content-Type | Size |
|---|---|---|
| v4-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch | text/x-patch | 2.7 KB |
| v4-0002-Add-RISC-V-popcount-using-Zbb-extension.patch | text/x-patch | 11.4 KB |
| v4-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch | text/x-patch | 19.3 KB |
| v4-0001-Avoid-Clang-RISC-V-auto-vectorization-bug-in-DES.patch | text/x-patch | 2.7 KB |
| v4-0002-Add-RISC-V-popcount-using-Zbb-extension.patch | text/x-patch | 11.4 KB |
| v4-0003-Add-RISC-V-CRC32C-using-the-Zbc-extension.patch | text/x-patch | 19.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | lin teletele | 2026-05-27 17:08:46 | Re: Use pg_current_xact_id() instead of deprecated txid_current() |
| Previous Message | Alexander Lakhin | 2026-05-27 17:00:01 | Re: 035_standby_logical_decoding might fail due to FATAL message lost inside libpq |