Re: Popcount optimization using AVX512

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Popcount optimization using AVX512
Date: 2024-04-04 17:18:28
Message-ID: 20240404171828.GA3866970@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote:
> On Thu, 4 Apr 2024 at 11:50, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>> If we can verify this approach won't cause segfaults and can stomach the
>> regression between 8 and 16 bytes, I'd happily pivot to this approach so
>> that we can avoid the function call dance that I have in v25.
>
> If we're worried about regressions with some narrow range of byte
> values, wouldn't it make more sense to compare that to cc4826dd5~1 at
> the latest rather than to some version that's already probably faster
> than PG16?

Good point. When compared with REL_16_STABLE, Ants's idea still wins:

bytes v25 v25+ants REL_16_STABLE
2 1108.205 1033.132 2039.342
4 1311.227 1289.373 3207.217
8 1927.954 2360.113 3200.238
16 2281.091 2365.408 4457.769
32 3856.992 2390.688 6206.689
64 3648.72 3242.498 9619.403
128 4108.549 3607.148 17912.081
256 4910.076 4496.852 33591.385

As before, with 2 and 4 bytes, HEAD is using the inlined approach, but
REL_16_STABLE is doing a function call. For 8 bytes, REL_16_STABLE is
doing a function call as well as a call to a function pointer. At 16
bytes, it's doing a function call and two calls to a function pointer.
With Ant's approach, both 8 and 16 bytes require a single call to a
function pointer, and of course we are using the AVX-512 implementation for
both.

I think this is sufficient to justify switching approaches.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2024-04-04 17:20:11 Re: RFC: Additional Directory for Extensions
Previous Message Jelte Fennema-Nio 2024-04-04 17:16:50 Re: WIP Incremental JSON Parser