From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | "Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com> |
Subject: | Re: Improve CRC32C performance on SSE4.2 |
Date: | 2025-03-04 17:36:09 |
Message-ID: | Z8c6Cfp-XIiJtGB5@nathan |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>> This could potentially lead to a small regression for machines with SSE
>> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
>> worry aobut.
>
> Note also upthread I mentioned we may have to go to 512-bit pclmul,
> since Zen 2 regresses on 128-bit. :-(
Ah, okay. You mean the AVX-512 version [0]? And are you thinking we'd use
the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the
SSE4.2 version for small inputs and use a function pointer for larger ones?
> I actually haven't seen any measurable difference with direct calls
> versus indirect, but it could very well be that the microbenchmark is
> hiding that since it's doing something unnatural by calling things a
> bunch of times in a loop. I want to try changing the benchmark to base
> the address it's computing on some bits from the crc from the last
> loop iteration. I think that would make it more latency-sensitive. We
> could also make it do an additional constant 20-byte input every time
> to make it resemble WAL more closely.
Looking back on some old benchmarks for small-ish inputs [0], the
difference does seem within the noise range. I suppose these functions
might be expensive enough to make the function pointer overhead negligible.
IME there's a big difference when a function pointer is used for an
instruction or two [2], but even relatively small inputs to the CRC-32C
functions might require several instructions.
>> The main question I have is whether we can simplify this by always using a
>> runtime check and by inlining slicing-by-8 for small inputs. That would be
>> dependent on the performance of slicing-by-8 and SSE 4.2 being comparable
>> for small inputs.
>
> Slicing-by-8 needs one lookup and one XOR per byte of input, and other
> overheads, so I think it would still be very slow.
That's my suspicion, too.
[0] https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192%40BL1PR11MB5304.namprd11.prod.outlook.com
[1] https://postgr.es/m/20231031033601.GA68409%40nathanxps13
[2] https://postgr.es/m/CAApHDvqyMNGVgwpaOPtENdq5uEMR%3DvSkRJEgG1S%2BX7Vtk1-EnA%40mail.gmail.com
--
nathan
From | Date | Subject | |
---|---|---|---|
Next Message | Anthonin Bonnefoy | 2025-03-04 17:37:09 | Re: Add Pipelining support in psql |
Previous Message | Robert Haas | 2025-03-04 17:28:30 | Re: Add -k/--link option to pg_combinebackup |