Re: add AVX2 support to simd.h

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Ants Aasma <ants(at)cybertec(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: add AVX2 support to simd.h
Date: 2024-01-10 02:06:08
Message-ID: CANWCAZbad=aZaMaJD4pphrYNQLsjBUHN5oJ=9zvb5abtOn9QSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 9, 2024 at 11:20 PM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote:
> > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
> >>
> >> > I suspect that there could be a regression lurking for some inputs
> >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be
> >> > able to read 4 vector registers worth of elements before taking the
> >> > fast path. There is then a tail of up to 15 elements that are now
> >> > checked one-by-one, but AVX2 would increase that to 31. That's getting
> >> > big enough to be noticeable, I suspect. It would be good to understand
> >> > that case (n*32 + 31), because it may also be relevant now. It's also
> >> > easy to improve for SSE2/NEON for v17.
> >>
> >> Good idea. If it is indeed noticeable, we might be able to "fix" it by
> >> processing some of the tail with shorter vectors. But that probably means
> >> finding a way to support multiple vector sizes on the same build, which
> >> would require some work.
> >
> > What I had in mind was an overlapping pattern I've seen in various
> > places: do one iteration at the beginning, then subtract the
> > aligned-down length from the end and do all those iterations. And
> > one-by-one is only used if the total length is small.
>
> Sorry, I'm not sure I understood this. Do you mean processing the first
> several elements individually or with SSE2 until the number of remaining
> elements can be processed with just the AVX2 instructions (a bit like how
> pg_comp_crc32c_armv8() is structured for memory alignment)?

If we have say 25 elements, I mean (for SSE2) check the first 16, then
the last 16. Some will be checked twice, but that's okay.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-01-10 02:38:17 Re: Make NUM_XLOGINSERT_LOCKS configurable
Previous Message Masahiko Sawada 2024-01-10 02:05:06 Re: [PoC] Improve dead tuple storage for lazy vacuum