From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | "Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com> |
Subject: | Re: [PATCH] Hex-coding optimizations using SVE on ARM. |
Date: | 2025-09-25 18:50:20 |
Message-ID: | aNWO7L43UevRErw_@nathan |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
> + if (unlikely(!success))
> + i = 0;
>
> This is after the main loop exits, and the cold path is literally one
> instruction, so the motivation is not apparent to me.
Removed. I was thinking about smaller inputs when I added this, but it
probably makes little difference.
>> Yeah, the compiler refuses unless the value is an integer literal. I
>> thought of using a switch statement to cover all the values used in-tree,
>> but I didn't like that, either.
>
> Neither option is great, but I mildly lean towards keeping it internal
> with "switch" or whatever: By putting the burden of specifying shift
> amounts on separately named functions we run a risk of combinatorial
> explosion in function names.
Done.
> (Separately, now I'm wondering if we can do the same for
> vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
> would allow getting rid of )
I think so. I doubt there's any performance advantage, but it could be
nice for code cleanup. (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.") I'll
do this in the related patch in the "couple of small patches for simd.h"
thread.
> + * back together to form the final hex-encoded string. It might be
> + * possible to squeeze out a little more gain by manually unrolling the
> + * loop, but for now we don't bother.
>
> My position (and I think the community agrees) is that manual
> unrolling is a rare desperation move that has to be justified, so we
> don't need to mention its lack.
Removed.
> + * Some compilers are picky about casts to the same underlying type, and others
> + * are picky about implicit conversions with vector types. This function does
> + * the same thing as vector32_broadcast(), but it returns a Vector8 and is
> + * carefully crafted to avoid compiler indigestion.
> + */
> +#ifndef USE_NO_SIMD
> +static inline Vector8
> +vector8_broadcast_u32(const uint32 c)
> +{
> +#ifdef USE_SSE2
> + return vector32_broadcast(c);
> +#elif defined(USE_NEON)
> + return (Vector8) vector32_broadcast(c);
> +#endif
> +}
>
> I'm ambivalent about this: The use case doesn't seem well motivated,
> since I don't know why we'd actually need to both broadcast arbitrary
> integers and also view the result as bytes. Setting arbitrary bytes is
> what we're really doing, and would be more likely be useful in the
> future (attached, only tested on x86, and I think part of the
> strangeness is the endianness you mentioned above). On the other hand,
> the Arm workaround results in awful generated code compared to what
> you have here. Since the "set" should be hoisted out of the outer
> loop, and we already rely on this pattern for vector8_highbit_mask
> anyway, it might be tolerable, and we can reduce the pain with bitwise
> NOT.
I think I disagree on this one. We're not broadcasting arbitrary bytes for
every vector element, we're broadcasting a patten of bytes that happens to
be wider than the element size. I would expect this to be a relatively
common use-case. Furthermore, the "set" API is closely tethered to the
vector size, which is fine for SSE2/Neon but may not work down the road
(not to mention the USE_NO_SIMD path). Also, the bitwise NOT approach
won't work because we need to use 0x0f000f00 and 0x000f000f to avoid
angering the assertion in vector8_pack_16(), as mentioned below.
> +/*
> + * Pack 16-bit elements in the given vectors into a single vector of 8-bit
> + * elements. NB: The upper 8-bits of each 16-bit element must be zeros, else
> + * this will produce different results on different architectures.
> + */
>
> v10 asserted this requirement -- that still seems like a good thing?
I had removed that because I worried the accumulator approach would cause
it to fail (it does), but looking again, that's easy enough to work around.
--
nathan
Attachment | Content-Type | Size |
---|---|---|
v12-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch | text/plain | 16.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Пополитов Владлен | 2025-09-25 18:54:34 | Re: Avoiding roundoff error in pg_sleep() |
Previous Message | Tom Lane | 2025-09-25 18:42:32 | Avoiding roundoff error in pg_sleep() |