Re: Popcount optimization using AVX512

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: "Amonson, Paul D" <paul(dot)d(dot)amonson(at)intel(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Popcount optimization using AVX512
Date: 2024-03-17 20:56:32
Message-ID: CAApHDvrb7MJRB6JuKLDEY2x_LKdFHwVbogpjZBCX547i5+rXOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 16 Mar 2024 at 04:06, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> I ran John Naylor's test_popcount module [0] with the following command on
> an i7-1195G7:
>
> time psql postgres -c 'select drive_popcount(10000000, 1024)'
>
> Without your patches, this seems to take somewhere around 8.8 seconds.
> With your patches, it takes 0.6 seconds. (I re-compiled and re-ran the
> tests a couple of times because I had a difficult time believing the amount
> of improvement.)
>
> [0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com

I think most of that will come from getting rid of the indirect
function that currently exists in pg_popcount().

Using the attached quick hack, the performance using John's test
module goes from:

-- master
postgres=# select drive_popcount(10000000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)

-- with attached hacky and untested patch
postgres=# select drive_popcount(10000000, 1024);
Time: 2539.029 ms (00:02.539)
Time: 2598.223 ms (00:02.598)
Time: 2611.435 ms (00:02.611)

--- and with the avx512 patch on an AMD 7945HX CPU:
postgres=# select drive_popcount(10000000, 1024);
Time: 564.982 ms
Time: 556.540 ms
Time: 554.032 ms

The following comment seems like it could do with some improvements.

* Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
* loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
* methods are used this will fall back to __builtin_* or pure software.

There's nothing much specific to Intel here. AMD Zen4 has AVX512.
Plus "pg_bunutils.c" should be "pg_bitutils.c" and "the the"

How about just:

* Use AVX-512 Intrinsics on supported CPUs. Fall back the software loop in
* pg_popcount_slow() when AVX-512 is unavailable.

Maybe it's worth exploring something along the lines of the attached
before doing the AVX512 stuff. It seems like a pretty good speed-up
and will apply for CPUs without AVX512 support.

David

Attachment Content-Type Size
remove_indirect_func_call_in_pg_popcount.patch.txt text/plain 928 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christophe Pettus 2024-03-17 21:05:54 Re: Regression tests fail with musl libc because libpq.so can't be loaded
Previous Message Andrew Dunstan 2024-03-17 20:33:40 Re: Regression tests fail with musl libc because libpq.so can't be loaded