Quick Links

Re: Improve CRC32C performance on SSE4.2

From:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
To:	John Naylor <johncnaylorls(at)gmail(dot)com>
Cc:	"Devulapalli, Raghuveer" <raghuveer(dot)devulapalli(at)intel(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>
Subject:	Re: Improve CRC32C performance on SSE4.2
Date:	2025-03-04 17:36:09
Message-ID:	Z8c6Cfp-XIiJtGB5@nathan
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>> This could potentially lead to a small regression for machines with SSE
>> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
>> worry aobut.
>
> Note also upthread I mentioned we may have to go to 512-bit pclmul,
> since Zen 2 regresses on 128-bit. :-(

Ah, okay. You mean the AVX-512 version [0]? And are you thinking we'd use
the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the
SSE4.2 version for small inputs and use a function pointer for larger ones?

> I actually haven't seen any measurable difference with direct calls
> versus indirect, but it could very well be that the microbenchmark is
> hiding that since it's doing something unnatural by calling things a
> bunch of times in a loop. I want to try changing the benchmark to base
> the address it's computing on some bits from the crc from the last
> loop iteration. I think that would make it more latency-sensitive. We
> could also make it do an additional constant 20-byte input every time
> to make it resemble WAL more closely.

Looking back on some old benchmarks for small-ish inputs [0], the
difference does seem within the noise range. I suppose these functions
might be expensive enough to make the function pointer overhead negligible.
IME there's a big difference when a function pointer is used for an
instruction or two [2], but even relatively small inputs to the CRC-32C
functions might require several instructions.

>> The main question I have is whether we can simplify this by always using a
>> runtime check and by inlining slicing-by-8 for small inputs. That would be
>> dependent on the performance of slicing-by-8 and SSE 4.2 being comparable
>> for small inputs.
>
> Slicing-by-8 needs one lookup and one XOR per byte of input, and other
> overheads, so I think it would still be very slow.

That's my suspicion, too.

[0] https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192%40BL1PR11MB5304.namprd11.prod.outlook.com
[1] https://postgr.es/m/20231031033601.GA68409%40nathanxps13
[2] https://postgr.es/m/CAApHDvqyMNGVgwpaOPtENdq5uEMR%3DvSkRJEgG1S%2BX7Vtk1-EnA%40mail.gmail.com

--
nathan

In response to

Re: Improve CRC32C performance on SSE4.2 at 2025-03-04 05:09:09 from John Naylor

Responses

Re: Improve CRC32C performance on SSE4.2 at 2025-03-05 01:51:21 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Anthonin Bonnefoy	2025-03-04 17:37:09	Re: Add Pipelining support in psql
Previous Message	Robert Haas	2025-03-04 17:28:30	Re: Add -k/--link option to pg_combinebackup