Re: add AVX2 support to simd.h

From: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, John Naylor <johncnaylorls(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: add AVX2 support to simd.h
Date: 2024-01-09 15:25:42
Message-ID: CANwKhkMvEr+EgRCX5eV39cdCBw_ArcevM0hmJqq-1dNgpAB0cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 9 Jan 2024 at 16:03, Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> On 29.11.23 18:15, Nathan Bossart wrote:
> > Using the same benchmark as we did for the SSE2 linear searches in
> > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:
> >
> > writers sse2 avx2 %
> > 256 1195 1188 -1
> > 512 928 1054 +14
> > 1024 633 716 +13
> > 2048 332 420 +27
> > 4096 162 203 +25
> > 8192 162 182 +12
>
> AFAICT, your patch merely provides an alternative AVX2 implementation
> for where currently SSE2 is supported, but it doesn't provide any new
> API calls or new functionality. One might naively expect that these are
> just two different ways to call the underlying primitives in the CPU, so
> these performance improvements are surprising to me. Or do the CPUs
> actually have completely separate machinery for SSE2 and AVX2, and just
> using the latter to do the same thing is faster?

The AVX2 implementation uses a wider vector register. On most current
processors the throughput of the instructions in question is the same
on 256bit vectors as on 128bit vectors. Basically, the chip has AVX2
worth of machinery and using SSE2 leaves half of it unused. Notable
exceptions are efficiency cores on recent Intel desktop CPUs and AMD
CPUs pre Zen 2 where AVX2 instructions are internally split up into
two 128bit wide instructions.

For AVX512 the picture is much more complicated. Some instructions run
at half rate, some at full rate, but not on all ALU ports, some
instructions cause aggressive clock rate reduction on some
microarchitectures. AVX-512 adds mask registers and masked vector
instructions that enable quite a bit simpler code in many cases.
Interestingly I have seen Clang make quite effective use of these
masked instructions even when using AVX2 intrinsics, but targeting an
AVX-512 capable platform.

The vector width independent approach used in the patch is nice for
simple cases by not needing a separate implementation for each vector
width. However for more complicated cases where "horizontal"
operations are needed it's going to be much less useful. But these
cases can easily just drop down to using intrinsics directly.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-01-09 15:27:42 Re: pg_dump: Remove obsolete trigger support
Previous Message torikoshia 2024-01-09 14:36:37 Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)