Quick Links

Re: [POC] verifying UTF-8 using SIMD instructions

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [POC] verifying UTF-8 using SIMD instructions
Date:	2021-07-22 11:38:50
Message-ID:	CAFBsxsFtTbnSehSVDBfy0dNLe+_TBhnvhyDt8_AfPct_XkTT7g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Thu, Jul 22, 2021 at 6:16 AM John Naylor

> One question is whether this "one size fits all" approach will be
> extensible to wider SIMD.

Sure, it'll just take a little more work and complexity. For one, 16-byte
SIMD can operate on 32-byte chunks with a bit of repetition:

- __m128i input;
+ __m128i input1;
+ __m128i input2;

-#define SIMD_STRIDE_LENGTH (sizeof(__m128i))
+#define SIMD_STRIDE_LENGTH 32

while (len >= SIMD_STRIDE_LENGTH)
{
- input = vload(s);
+ input1 = vload(s);
+ input2 = vload(s + sizeof(input1));

- check_for_zeros(input, &error);
+ check_for_zeros(input1, &error);
+ check_for_zeros(input2, &error);

/*
* If the chunk is all ASCII, we can skip the full UTF-8
check, but we
@@ -460,17 +463,18 @@ pg_validate_utf8_sse42(const unsigned char *s, int
len)
* sequences at the end. We only update prev_incomplete if
the chunk
* contains non-ASCII, since the error is cumulative.
*/
- if (is_highbit_set(input))
+ if (is_highbit_set(bitwise_or(input1, input2)))
{
- check_utf8_bytes(prev, input, &error);
- prev_incomplete = is_incomplete(input);
+ check_utf8_bytes(prev, input1, &error);
+ check_utf8_bytes(input1, input2, &error);
+ prev_incomplete = is_incomplete(input2);
}
else
{
error = bitwise_or(error, prev_incomplete);
}

- prev = input;
+ prev = input2;
s += SIMD_STRIDE_LENGTH;
len -= SIMD_STRIDE_LENGTH;
}

So with a few #ifdefs, we can accommodate two sizes if we like.

For another, the prevN() functions would need to change, at least on x86 --
that would require replacing _mm_alignr_epi8() with _mm256_alignr_epi8()
plus _mm256_permute2x128_si256(). Also, we might have to do something with
the vector typedef.

That said, I think we can punt on that until we have an application that's
much more compute-intensive. As it is with SSE4, COPY FROM WHERE <selective
predicate> already pushes the utf8 validation way down in profiles.

Nice! If it passes regression tests, it *should* be fine, but stress
testing would be welcome on any platform.

> I also tried to do a quick and dirty AltiVec patch to see if it could
> fit into the same code "shape", with less immediate success: it works
> out slower than the fallback code on the POWER7 machine I scrounged an
> account on. I'm not sure what's wrong there, but maybe it's a uesful
> start (I'm probably confused about endianness, or the encoding of
> boolean vectors which may be different (is true 0x01or 0xff, does it
> matter?), or something else, and it's falling back on errors all the
> time?).

Hmm, I have access to a power8 machine to play with, but I also don't mind
having some type of server-class hardware that relies on the recent nifty
DFA fallback, which performs even better on powerpc64le than v15.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Re: [POC] verifying UTF-8 using SIMD instructions at 2021-07-22 00:07:26 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2021-07-22 11:44:58	Re: row filtering for logical replication
Previous Message	Amit Kapila	2021-07-22 10:41:08	Re: [BUG]Update Toast data failure in logical replication