| From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
|---|---|
| To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
| Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: refactor architecture-specific popcount code |
| Date: | 2026-01-22 17:50:38 |
| Message-ID: | aXJjbkp2_glyfy6z@nathan |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote:
> 1) Nowadays, the only global call sites of the word-sized functions
> are select_best_grantor() and in bitmapsets. The latter calls the
> word-sized functions in a loop (could be just one word). It may be
> more efficient to calculate the size in bytes and call pg_popcount().
Yeah, these seem like obvious places to use pg_popcount(). Note that
bms_member_index() does a final popcount on a masked version of the last
word. We could swap that with pg_popcount(), too, but it might be slower
than just calling the word-sized function. However, it could be hard to
tell the difference, as we'd be trading a function or function pointer call
with an inlined loop over pg_number_of_ones. And even if it is slower, I'm
not sure it matters all that much in the grand scheme of things.
In any case, 0001 gets the easy ones out of the way.
> Then we could get rid of all the pointer indirection for the
> word-sized functions.
Do you mean that we'd just keep the portable ones around? I see some code
in pgvector that might be negatively impacted by that, but if I understand
correctly it would require an unusual setup.
> 2) The x86 byte buffer variants expend a lot of effort to detect
> whether the buffer is aligned on both 64- and 32-bit platforms, with
> an optimized path for each. At least 64-bit doesn't care about
> alignment, and 32-bit doesn't warrant anything fancier than pure C.
> Simultaneously, the aarch64 equivalent doesn't seem to take care about
> alignment. (I think Nathan mentioned he didn't see a difference during
> testing, but I wonder how universal that is).
0002 makes these changes for pg_popcount_sse42() and
pg_popcount_masked_sse42(). It does seem strange to prefer a loop over
pg_number_of_ones instead of using POPCNTQ when unaligned, but perhaps it's
worth testing. I do recall the alignment stuff in the AVX-512 code showing
benefits in tests because it avoids double-load overhead, so I think we
should keep that for now.
> 3) There is repeated code for the <8 bytes case, and the tail of the
> "optimized" functions. I'm also not sure why the small case is inlined
> everywhere.
This is intended to help avoid function call and SIMD instruction overhead
when it doesn't make sense to take it. I recall this showing a rather big
difference in benchmarks when we were working on the AVX-512 versions.
Regarding the duplicated code, sure, we could add some static inline
functions or something. I think the only reason I haven't done so is
because it's ~2 lines of code.
--
nathan
| Attachment | Content-Type | Size |
|---|---|---|
| v3-0001-Make-use-of-pg_popcount-in-more-places.patch | text/plain | 2.1 KB |
| v3-0002-Remove-unnecessary-32-bit-optimizations-and-align.patch | text/plain | 2.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Noah Misch | 2026-01-22 17:42:50 | Re: Inval reliability, especially for inplace updates |