Quick Links

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

From:	Nathan Bossart <nathandbossart(at)gmail(dot)com>
To:	John Naylor <johncnaylorls(at)gmail(dot)com>
Cc:	"Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com" <Chiranmoy(dot)Bhattacharya(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "Devanga(dot)Susmitha(at)fujitsu(dot)com" <Devanga(dot)Susmitha(at)fujitsu(dot)com>, "Ragesh(dot)Hajela(at)fujitsu(dot)com" <Ragesh(dot)Hajela(at)fujitsu(dot)com>
Subject:	Re: [PATCH] Hex-coding optimizations using SVE on ARM.
Date:	2025-09-25 18:50:20
Message-ID:	aNWO7L43UevRErw_@nathan
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:
> + if (unlikely(!success))
> + i = 0;
>
> This is after the main loop exits, and the cold path is literally one
> instruction, so the motivation is not apparent to me.

Removed. I was thinking about smaller inputs when I added this, but it
probably makes little difference.

>> Yeah, the compiler refuses unless the value is an integer literal. I
>> thought of using a switch statement to cover all the values used in-tree,
>> but I didn't like that, either.
>
> Neither option is great, but I mildly lean towards keeping it internal
> with "switch" or whatever: By putting the burden of specifying shift
> amounts on separately named functions we run a risk of combinatorial
> explosion in function names.

Done.

> (Separately, now I'm wondering if we can do the same for
> vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
> would allow getting rid of )

I think so. I doubt there's any performance advantage, but it could be
nice for code cleanup. (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.") I'll
do this in the related patch in the "couple of small patches for simd.h"
thread.

> + * back together to form the final hex-encoded string. It might be
> + * possible to squeeze out a little more gain by manually unrolling the
> + * loop, but for now we don't bother.
>
> My position (and I think the community agrees) is that manual
> unrolling is a rare desperation move that has to be justified, so we
> don't need to mention its lack.

Removed.

> + * Some compilers are picky about casts to the same underlying type, and others
> + * are picky about implicit conversions with vector types. This function does
> + * the same thing as vector32_broadcast(), but it returns a Vector8 and is
> + * carefully crafted to avoid compiler indigestion.
> + */
> +#ifndef USE_NO_SIMD
> +static inline Vector8
> +vector8_broadcast_u32(const uint32 c)
> +{
> +#ifdef USE_SSE2
> + return vector32_broadcast(c);
> +#elif defined(USE_NEON)
> + return (Vector8) vector32_broadcast(c);
> +#endif
> +}
>
> I'm ambivalent about this: The use case doesn't seem well motivated,
> since I don't know why we'd actually need to both broadcast arbitrary
> integers and also view the result as bytes. Setting arbitrary bytes is
> what we're really doing, and would be more likely be useful in the
> future (attached, only tested on x86, and I think part of the
> strangeness is the endianness you mentioned above). On the other hand,
> the Arm workaround results in awful generated code compared to what
> you have here. Since the "set" should be hoisted out of the outer
> loop, and we already rely on this pattern for vector8_highbit_mask
> anyway, it might be tolerable, and we can reduce the pain with bitwise
> NOT.

I think I disagree on this one. We're not broadcasting arbitrary bytes for
every vector element, we're broadcasting a patten of bytes that happens to
be wider than the element size. I would expect this to be a relatively
common use-case. Furthermore, the "set" API is closely tethered to the
vector size, which is fine for SSE2/Neon but may not work down the road
(not to mention the USE_NO_SIMD path). Also, the bitwise NOT approach
won't work because we need to use 0x0f000f00 and 0x000f000f to avoid
angering the assertion in vector8_pack_16(), as mentioned below.

> +/*
> + * Pack 16-bit elements in the given vectors into a single vector of 8-bit
> + * elements. NB: The upper 8-bits of each 16-bit element must be zeros, else
> + * this will produce different results on different architectures.
> + */
>
> v10 asserted this requirement -- that still seems like a good thing?

I had removed that because I worried the accumulator approach would cause
it to fail (it does), but looking again, that's easy enough to work around.

--
nathan

Attachment	Content-Type	Size
v12-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patch	text/plain	16.6 KB

In response to

Re: [PATCH] Hex-coding optimizations using SVE on ARM. at 2025-09-25 14:16:35 from John Naylor

Responses

Re: [PATCH] Hex-coding optimizations using SVE on ARM. at 2025-09-29 08:45:27 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Пополитов Владлен	2025-09-25 18:54:34	Re: Avoiding roundoff error in pg_sleep()
Previous Message	Tom Lane	2025-09-25 18:42:32	Avoiding roundoff error in pg_sleep()