RE: Popcount optimization using AVX512

From: "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Matthias van de Meent" <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Popcount optimization using AVX512
Date: 2024-03-11 21:59:53
Message-ID: BL1PR11MB5304165E7A81E0107E70B069DC242@BL1PR11MB5304.namprd11.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
> Sent: Thursday, March 7, 2024 1:36 PM
> Subject: Re: Popcount optimization using AVX512

I will be splitting the request into 2 patches. I am attaching the first patch (refactoring only) and I updated the commitfest entry to match this patch. I have a question however:
Do I need to wait for the refactor patch to be merged before I post the AVX portion of this feature in this thread?

> > + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
>
> I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my
> machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are
> other instructions required that need -mavx512f, then we might need to
> expand the test.

First, nice catch on the required flags to build! When I changed my algorithm, dependence on the -mavx512f flag was no longer needed, In the second patch (AVX specific) I will fix this.

> I still think it's worth breaking this change into at least 2 patches. In particular,
> I think there's an opportunity to do the refactoring into pg_popcnt_choose.c
> and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff. These
> changes are likely straightforward, and getting them out of the way early
> would make it easier to focus on the more interesting changes. IMHO there
> are a lot of moving parts in this patch.

As stated above I am doing this in 2 patches. :)

> > +#undef HAVE__GET_CPUID_COUNT
> > +
> > +/* Define to 1 if you have immintrin. */ #undef HAVE__IMMINTRIN
>
> Is this missing HAVE__CPUIDEX?

Yes I missed it, I will include in the second patch (AVX specific) of the 2 patches.

> > uint64
> > -pg_popcount(const char *buf, int bytes)
> > +pg_popcount_slow(const char *buf, int bytes)
> > {
> > uint64 popcnt = 0;
> >
> > -#if SIZEOF_VOID_P >= 8
> > +#if SIZEOF_VOID_P == 8
> > /* Process in 64-bit chunks if the buffer is aligned. */
> > if (buf == (const char *) TYPEALIGN(8, buf))
> > {
> > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
> >
> > buf = (const char *) words;
> > }
> > -#else
> > +#elif SIZEOF_VOID_P == 4
> > /* Process in 32-bit chunks if the buffer is aligned. */
> > if (buf == (const char *) TYPEALIGN(4, buf))
> > {
>
> Apologies for harping on this, but I'm still not seeing the need for these
> SIZEOF_VOID_P changes. While it's unlikely that this makes any practical
> difference, I see no reason to more strictly check SIZEOF_VOID_P here.

I got rid of the second occurrence as I agree it is not needed but unless you see something I don't how to know which function to call between a 32-bit and 64-bit architecture? Maybe I am missing something obvious? What exactly do you suggest here? I am happy to always call either pg_popcount32() or pg_popcount64() with the understanding that it may not be optimal, but I do need to know which to use.

> > + /* Process any remaining bytes */
> > + while (bytes--)
> > + popcnt += pg_number_of_ones[(unsigned char) *buf++];
> > + return popcnt;
> > +#else
> > + return pg_popcount_slow(buf, bytes);
> > +#endif /* USE_AVX512_CODE */
>
> nitpick: Could we call pg_popcount_slow() in a common section for these
> "remaining bytes?"

Agreed, will fix in the second patch as well.

> > +#if defined(_MSC_VER)
> > + pg_popcount_indirect = pg_popcount512_fast; #else
> > + pg_popcount = pg_popcount512_fast; #endif

> These _MSC_VER sections are interesting. I'm assuming this is the
> workaround for the MSVC linking issue you mentioned above. I haven't
> looked too closely, but I wonder if the CRC32C code (see
> src/include/port/pg_crc32c.h) is doing something different to avoid this issue.

Using the latest master branch, I see what the needed changes are, I will implement using PGDLLIMPORT macro in the second patch.

> Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned
> through this thread and didn't see any recent benchmark results for the latest
> form of the patch. I think it's worth verifying that we are still seeing the
> expected improvements.

I will get new benchmarks using the same process I used before (from Akash) so I get apples to apples. These are pending completion of the second patch which is still in progress.

Just a reminder, I asked questions above about 1) multi-part dependent patches and, 2) What specifically to do about the SIZE_VOID_P checks. :)

Thanks,
Paul

Attachment Content-Type Size
v7-0001-Refactor-POPCNT.patch application/octet-stream 7.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-03-11 22:38:45 Re: Combine Prune and Freeze records emitted by vacuum
Previous Message Bruce Momjian 2024-03-11 21:17:13 Re: Reports on obsolete Postgres versions