| From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
|---|---|
| To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
| Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: refactor architecture-specific popcount code |
| Date: | 2026-02-02 22:51:54 |
| Message-ID: | aYEqini0ukxQv2_D@nathan |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote:
> It might be a good idea to do a little new testing, and I see a use
> for a special 8-byte path independent of AVX512: v6 seems to regress a
> little for single-words. But, it turns out that when gcc turns
> __builtin_popcountl into a single instruction, it's inline, but if it
> emits portable bitwise ops, it does so in a function called
> __popcountdi2(). That can be avoided by hand-coding in C for normal
> builds (and for 32-bit looks cleaner anyway), as in the attached 0005.
Oh, interesting. I looked into this a little more [0]. Both gcc and clang
generate cnt instructions for aarch64, so we're good there. However, clang
on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates
a call to __popcountdi2() (which I imagine does something similar). It's
not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang
start generating popcnt instructions for x86-64, which makes sense. 0005
seems like the correct move to me...
[0] https://godbolt.org/z/he3WozG3E
--
nathan
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Radim Marek | 2026-02-02 22:54:24 | Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS |
| Previous Message | Melanie Plageman | 2026-02-02 22:47:23 | Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) |