RE: Popcount optimization using AVX512

From: "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Popcount optimization using AVX512
Date: 2023-11-15 20:27:57
Message-ID: PH0PR11MB5000EFC19DD2C07F09871161F2B1A@PH0PR11MB5000.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry for the late response here. We spent some time researching and measuring the frequency impact of AVX512 instructions used here.

>How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

AVX512 has light and heavy instructions. While the heavy AVX512 instructions have clock frequency implications, the light instructions not so much. See [0] for more details. We captured EMON data for the benchmark used in this work, and see that the instructions are using the licensing level not meant for heavy AVX512 operations. This means the instructions for popcount : _mm512_popcnt_epi64(), _mm512_reduce_add_epi64() are not going to have any significant impact on CPU clock frequency.
Clock frequency impact aside, we measured the same benchmark for gains on older Intel hardware and observe up to 18% better performance on Intel Icelake. On older intel hardware, the popcntdq 512 instruction is not present so it won’t work. If clock frequency is not affected, rest of workload should not be impacted in the case of mixed workloads.

>Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Testing this on smaller block sizes < 8KiB shows that AVX512 compared to the current 64bit behavior shows slightly lower performance, but with a large variance. We cannot conclude much from it. The testing with ANALYZE benchmark by Nathan also points to no visible impact as a result of using AVX512. The gains on larger dataset is easily evident, with less variance.
What are your thoughts if we introduce AVX512 popcount for smaller sizes as an optional feature initially, and then test it more thoroughly over time on this particular use case?

Regarding enablement, following the other responses related to function inlining, using ifunc and enabling future intrinsic support, it seems a concrete solution would require further discussion. We’re attaching a patch to enable AVX512, which can use AVX512 flags during build. For example:
>make -E CFLAGS_AVX512="-mavx -mavx512dq -mavx512vpopcntdq -mavx512vl -march=icelake-server -DAVX512_POPCNT=1"

Thoughts or feedback on the approach in the patch? This solution should not impact anyone who doesn’t use the feature i.e. AVX512. Open to additional ideas if this doesn’t seem like the right approach here.

[0] https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

-----Original Message-----
From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Sent: Tuesday, November 7, 2023 12:15 PM
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>; Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>; Amonson, Paul D <paul(dot)d(dot)amonson(at)intel(dot)com>; pgsql-hackers(at)lists(dot)postgresql(dot)org; Shankaran, Akash <akash(dot)shankaran(at)intel(dot)com>
Subject: Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:
> On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
>> On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
>> > The glibc/gcc "ifunc" mechanism was designed to solve this problem
>> > of choosing a function implementation based on the runtime CPU,
>> > without incurring function pointer overhead. I would not attempt
>> > to use AVX512 on non-glibc systems, and I would use ifunc to select the desired popcount implementation on glibc:
>> > https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.ht
>> > ml
>>
>> Thanks, that seems promising for the function pointer cases. I'll
>> plan on trying to convert one of the existing ones to use it. BTW it
>> looks like LLVM has something similar [0].
>>
>> IIUC this unfortunately wouldn't help for cases where we wanted to
>> keep stuff inlined, such as is_valid_ascii() and the functions in
>> pg_lfind.h, unless we applied it to the calling functions, but that
>> doesn't ѕound particularly maintainable.
>
> Agreed, it doesn't solve inline cases. If the gains are big enough,
> we should move toward packages containing N CPU-specialized copies of
> the postgres binary, with bin/postgres just exec'ing the right one.

I performed a quick test with ifunc on my x86 machine that ordinarily uses the runtime checks for the CRC32C code, and I actually see a consistent 3.5% regression for pg_waldump -z on 100M 65-byte records. I've attached the patch used for testing.

The multiple-copies-of-the-postgres-binary idea seems interesting. That's probably not something that could be enabled by default, but perhaps we could add support for a build option.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
proposed_popcnt.patch application/octet-stream 4.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-11-15 20:29:16 Re: Allow tests to pass in OpenSSL FIPS mode
Previous Message Andres Freund 2023-11-15 20:21:33 Re: Some performance degradation in REL_16 vs REL_15