RE: Popcount optimization using AVX512

From: "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, "Noah Misch" <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Matthias van de Meent" <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Popcount optimization using AVX512
Date: 2024-02-12 20:14:06
Message-ID: BL1PR11MB5304F65FD8D19B608F9E8624DC482@BL1PR11MB5304.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My responses with questions,

> > +# XXX: The configure.ac check for __cpuidex() is broken, we don't
> > +copy that # here. To prevent problems due to two detection methods
> > +working, stop # checking after one.
>
> This seems like a bogus copy-paste.

My bad. Will remove the offending comment. :)

> > +# Check for header immintrin.h
> ...
> Do these all actually have to link? Invoking the linker is slow.
> I think you might be able to just use cc.has_header_symbol().

I took this to mean the last of the 3 new blocks. I changed this one to the cc_has_header method. I think I do want the first 2 checking the link as well. If the don't link here they won't link in the actual build.

> Does this work with msvc?

I think it will work but I have no way to validate it. I propose we remove the AVX-512 popcount feature from MSVC builds. Sound ok?

> That's a very long line in the output, how about using the avx feature name or something?

Agree, will fix.

> This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?

This will take me some time to learn how to do this in meson. Any pointers here would be helpful.

> While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.

I don't think its broken, nothing else in pg_bitutils.c will make use of AVX-512, so I am not sure what dividing this up into multiple files will yield benefits beyond code readability as they will all be needed during compile time. I prefer to not split if the community agrees to it.

If splitting still makes sense, I propose splitting into 3 files: pg_bitutils.c (entry point +sw popcnt implementation), pg_popcnt_choose.c (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86 implementations).
I'm not an expert in meson, but splitting might add complexity to meson.build.

Could you elaborate if there are other benefits to the split file approach?

Paul

-----Original Message-----
From: Andres Freund <andres(at)anarazel(dot)de>
Sent: Friday, February 9, 2024 10:35 AM
To: Amonson, Paul D <paul(dot)d(dot)amonson(at)intel(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>; Shankaran, Akash <akash(dot)shankaran(at)intel(dot)com>; Nathan Bossart <nathandbossart(at)gmail(dot)com>; Noah Misch <noah(at)leadboat(dot)com>; Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>; Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>; pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Popcount optimization using AVX512

Hi,

On 2024-02-09 17:39:46 +0000, Amonson, Paul D wrote:

> diff --git a/meson.build b/meson.build index 8ed51b6aae..1e7a4dc942
> 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1773,6 +1773,45 @@ elif cc.links('''
> endif
>
>
> +# XXX: The configure.ac check for __cpuidex() is broken, we don't
> +copy that # here. To prevent problems due to two detection methods
> +working, stop # checking after one.

This seems like a bogus copy-paste.

> +if cc.links('''
> + #include <cpuid.h>
> + int main(int arg, char **argv)
> + {
> + unsigned int exx[4] = {0, 0, 0, 0};
> + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
> + }
> + ''', name: '__get_cpuid_count',
> + args: test_c_args)
> + cdata.set('HAVE__GET_CPUID_COUNT', 1) elif cc.links('''
> + #include <intrin.h>
> + int main(int arg, char **argv)
> + {
> + unsigned int exx[4] = {0, 0, 0, 0};
> + __cpuidex(exx, 7, 0);
> + }
> + ''', name: '__cpuidex',
> + args: test_c_args)
> + cdata.set('HAVE__CPUIDEX', 1)
> +endif
> +
> +
> +# Check for header immintrin.h
> +if cc.links('''
> + #include <immintrin.h>
> + int main(int arg, char **argv)
> + {
> + return 1701;
> + }
> + ''', name: '__immintrin',
> + args: test_c_args)
> + cdata.set('HAVE__IMMINTRIN', 1)
> +endif

Do these all actually have to link? Invoking the linker is slow.

I think you might be able to just use cc.has_header_symbol().

> +###############################################################
> +# AVX 512 POPCNT Intrinsic check
> +###############################################################
> +have_avx512_popcnt = false
> +cflags_avx512_popcnt = []
> +if host_cpu == 'x86_64'
> + prog = '''
> + #include <immintrin.h>
> + #include <stdint.h>
> + void main(void)
> + {
> + __m512i tmp __attribute__((aligned(64)));
> + __m512i input = _mm512_setzero_si512();
> + __m512i output = _mm512_popcnt_epi64(input);
> + uint64_t cnt = 999;
> + _mm512_store_si512(&tmp, output);
> + cnt = _mm512_reduce_add_epi64(tmp);
> + /* return computed value, to prevent the above being optimized away */
> + return cnt == 0;
> + }'''

Does this work with msvc?

> + if cc.links(prog, name: '_mm512_setzero_si512,
> + _mm512_popcnt_epi64, _mm512_store_si512, and _mm512_reduce_add_epi64
> + with -mavx512vpopcntdq -mavx512f',

That's a very long line in the output, how about using the avx feature name or something?

> diff --git a/src/port/Makefile b/src/port/Makefile index
> dcc8737e68..6a01a7d89a 100644
> --- a/src/port/Makefile
> +++ b/src/port/Makefile
> @@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
> pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
> pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
>
> +# Newer Intel processors can use AVX-512 POPCNT Capabilities
> +(01/30/2024)
> +pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
> +pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
> +pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT)
> +
> # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
> pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
> pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC) diff --git
> a/src/port/meson.build b/src/port/meson.build index
> 69b30ab21b..1c48a3b07e 100644
> --- a/src/port/meson.build
> +++ b/src/port/meson.build
> @@ -184,6 +184,7 @@ foreach name, opts : pgport_variants
> link_with: cflag_libs,
> c_pch: pch_c_h,
> kwargs: opts + {
> + 'c_args': opts.get('c_args', []) + cflags_avx512_popcnt,
> 'dependencies': opts['dependencies'] + [ssl],
> }
> )

This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject automatic uses of avx512 in places, which would cause problems, no?

While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that reson? That's why the existing code builds the files for various crc variants as their own file.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-02-12 20:17:32 Why does BitmapPrefetch() skip fetch based on current block recheck flag
Previous Message Alvaro Herrera 2024-02-12 20:11:33 Re: backend *.c #include cleanup (IWYU)