From: | Eduard Stefes <Eduard(dot)Stefes(at)ibm(dot)com> |
---|---|
To: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Cc: | "iii(at)linux(dot)ibm(dot)com" <iii(at)linux(dot)ibm(dot)com>, Eduard Stefes <Eduard(dot)Stefes(at)ibm(dot)com>, "rueckner(at)linux(dot)ibm(dot)com" <rueckner(at)linux(dot)ibm(dot)com> |
Subject: | Review/Pull Request: Adding new CRC32C implementation for IBM S390X |
Date: | 2025-05-07 10:37:50 |
Message-ID: | 918d9941377f6e83fbfebe96ba496ccaefa3803f.camel@ibm.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Here I send a patch that adds a vectorized version of CRC32C for the
IBM S390X hardware. I kindly ask for a review of the code and to pick
it up in upstream postgresql.
# Why this patch:
We noticed that postgres running on an S390X will spend much longer in
CRC32C as compared to other platform with optimized crc32c. Kindl
Hendrik Brueckner would allow us to re-license his implementation of an
optimized crc32c under postgres-license. This implementation is already
used in gzip, zlib-ng and the linux kernel.
# The optimized CRC32C:
The IBM S390X platform has no dedicated CRC infrastructure. The
algorithm works by using >>reduction constants to fold and process
particular chunks of the input data stream in parallel.<< This makes
grate use of the S390X vector units. Depending on the size of the input
stream a speedup in the order of magnitude can be achieved(compared to
sb8).
# Runtime checks:
The runtime detection strategy follows the same approach as the ARM
code. If the code is compiled with all needed flags enabled the runtime
detection will not be compiled in. If the build system can enable all
needed flags itself, it will also enable runtime detection.
# Slicing by 8:
The standard sb8 code is still always compiled and will be used for
this cases:
- the vector units need to operate on double word boundaries. If the
input stream is not aligned we use sb8 up to the next boundary
- using the vector units for data smaller then 64 byte will neglect the
speed improvement of the algorithm, as register setup and post
processing will eat up all benefits.
- the reduction and folding constants are precalculated for 64 byte
chunks. Adding code for smaller chunks would drastically increase the
complexity.
# The glue code:
I tried to follow the postgres coding conventions. I ran
`./pg_bsd_indent -i4 -l79 -di12 -nfc1 -nlp -sac ...` as mentioned in
src/tools/pg_bsd_indent/README. But for me this will absolutely not
format code according to the postgres coding convention. Therefor I
formatted everything by hand.
I feared that simply writing a function pointer in a software spawning
many threads and forks might cause issues. So i decided to use
`__atomic_store_n` to set the CRC function pointer. Indeed I noticed
that the other _choose.c files did not do this. However I am very
confident that `__atomic_store_n` will always be available on a S390X.
As this is the first time I am writing m4/autotools, I'd kindly ask the
reviewer for special care there :) . There may be dragons. But I have
high hopes all is OK.
Cheers and thanks to all for their work,
--
Eduard Stefes <eduard(dot)stefes(at)ibm(dot)com>
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Added-crc32c-extension-for-ibm-s390x-based-on-VX-.patch | text/x-patch | 37.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Álvaro Herrera | 2025-05-07 10:45:28 | Re: Incorrect calculation of path fraction value in MergeAppend |
Previous Message | Matthias van de Meent | 2025-05-07 09:38:58 | Re: PostgreSQL 18 Beta 1 release announcement draft |