Popcount optimization using AVX512

From: "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject: Popcount optimization using AVX512
Date: 2023-11-02 14:22:10
Message-ID: BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A@BL1PR11MB5304.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share the preliminary results with the community and get feedback for adding avx512 support for popcount.
 
Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the pg_popcount() in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this implementation has improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit scenarios relying on popcount.
 
My setup:
 
Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
OS : Ubuntu 22.04
GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".

1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
a. Software only and
b. SSE 64 bit version
2. I created an implementation using the following AVX512 intrinsics:
a. _mm512_popcnt_epi64()
b. _mm512_reduce_add_epi64()
3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed [std::mt19937_64])
4. I tested 5 seeds for each input buffer size and averaged 100 runs each (5*5*100=2500 pg_popcount() calls on a single thread)
5. Data: <See Attached picture.>

The code I wrote uses the 64-bit solution or SW on the memory not aligned to a 512-bit boundary in memory:
 
///////////////////////////////////////////////////////////////////////
// 512-bit intrisic implementation (AVX512VPOPCNTDQ + AVX512F)
uint64_t popcount_512_impl(const char *bytes, int byteCount) {
#ifdef __AVX__
uint64_t result = 0;
uint64_t remainder = ((uint64_t)bytes) % 64;
result += popcount_64_impl(bytes, remainder);
byteCount -= remainder;
bytes += remainder;
uint64_t vectorCount = byteCount / 64;
remainder = byteCount % 64;
__m512i *vectors = (__m512i *)bytes;
__m512i rv;
while (vectorCount--) {
rv = _mm512_popcnt_epi64(*(vectors++));
result += _mm512_reduce_add_epi64(rv);
}
bytes = (const char *)vectors;
result += popcount_64_impl(bytes, remainder);
return result;
#else
return popcount_64_impl(bytes, byteCount);
#endif
}
 
There are further optimizations that can be applied here, but for demonstration I added the __AVX__ macro and if not fall back to the original implementations in PostgreSQL.
 
The 46% improvement in popcount is worthy of discussion considering the previous popcount 64-bit SSE and SW implementations.
 
 Thanks,
Paul Amonson

Attachment Content-Type Size
AVX512 Popcount Benefits.png image/png 104.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-11-02 14:23:56 Re: Tab completion regression test failed on illumos
Previous Message John Naylor 2023-11-02 14:21:02 Re: Extract numeric filed in JSONB more effectively