Re: [POC] verifying UTF-8 using SIMD instructions

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-16 01:32:52
Message-ID: CAFBsxsEinN2YLsGj_skeGu+o3Lht6DMS651z2J2UXO6iARyrtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>

Attached is the first attempt at using SSE4 to do the validation, but first
I'll answer your questions about the fallback.

I should mention that v2 had a correctness bug for 4-byte characters that I
found when I was writing regression tests. It shouldn't materially affect
performance, however.

> I thought the "chinese" numbers above are pure multibyte input, and it
> seems to do well on that. Where does it regress? In multibyte encodings
> other than UTF-8?

Yes, the second set of measurements was intended to represent multibyte
encodings other than UTF-8. But instead of using one of those encodings, I
simulated non-UTF-8 by copying the pattern used for those: in the loop,
check for ascii then either advance or verify one character. It was a quick
way to use the same test.

> How bad is the regression?

I'll copy the measurements here together with master so it's easier to
compare:

~= PG13 (revert 6c5576075b0f9 and b80e10638e3):

chinese | mixed | ascii
---------+-------+-------
1565 | 848 | 365

master:

chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366

ascii fast-path plus pg_*_verifychar():

chinese | mixed | ascii
---------+-------+-------
1279 | 656 | 94

As I mentioned upthread, pure multibyte is still faster than PG13. Reducing
the ascii check to 8-bytes at time might alleviate the regression.

> I tested this on my first generation Raspberry Pi (chipmunk). I had to
> tweak it a bit to make it compile, since the SSE autodetection code was
> not finished yet. And I used generate_series(1, 1000) instead of
> generate_series(1, 10000) in the test script (mbverifystr-speed.sql)
> because this system is so slow.
>
> master:
>
> mixed | ascii
> -------+-------
> 1310 | 1041
> (1 row)
>
> v2-add-portability-stub-and-new-fallback.patch:
>
> mixed | ascii
> -------+-------
> 2979 | 910
> (1 row)
>
> I'm guessing that's because the unaligned access in check_ascii() is
> expensive on this platform.

Hmm, I used memcpy() as suggested. Is that still slow on that platform?
That's 32-bit, right? Some possible remedies:

1) For the COPY FROM case, we should align the allocation on a cacheline --
we already have examples of that idiom elsewhere. I was actually going to
suggest doing this anyway, since unaligned SIMD loads are often slower, too.

2) As the simdjson fallback was based on Fuchsia (the Lemire paper implies
it was tested carefully on Arm and I have no reason to doubt that), I could
try to follow that example more faithfully by computing the actual
codepoints. It's more computation and just as many branches as far as I can
tell, but it's not a lot of work. I can add that alternative fallback to
the patch set. I have no Arm machines, but I can test on a POWER8 machine.

3) #ifdef out the ascii check for 32-bit platforms.

4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time.
I'll probably try this first.

Now, I'm pleased to report that I got SSE4 working, and it seems to work.
It still needs some stress testing to find any corner case bugs, but it
shouldn't be too early to share some numbers on Clang 10 / MacOS:

master:

chinese | mixed | ascii
---------+-------+-------
1082 | 751 | 364

v3 with SSE4.1:

chinese | mixed | ascii
---------+-------+-------
127 | 128 | 126

Some caveats and notes:

- It takes almost no recognizable code from simdjson, but it does take the
magic constants lookup tables almost verbatim. The main body of the code
has no intrinsics at all (I think). They're all hidden inside static inline
helper functions. I reused some cryptic variable names from simdjson. It's
a bit messy but not terrible.

- It diffs against the noError conversion patch and adds additional tests.

- It's not smart enough to stop at the last valid character boundary --
it's either all-valid or it must start over with the fallback. That will
have to change in order to work with the proposed noError conversions. It
shouldn't be very hard, but needs thought as to the clearest and safest way
to code it.

- There is no ascii fast-path yet. With this algorithm we have to be a bit
more careful since a valid ascii chunk could be preceded by an incomplete
sequence at the end of the previous chunk. Not too hard, just a bit more
work.

- This is my first time hacking autoconf, and it still seems slightly
broken, yet functional on my machine at least.

- It only needs SSE4.1, but I didn't want to create a whole new CFLAGS, so
it just reuses SSE4.2 for the runtime check and the macro names. Also, it
doesn't test for SSE2, it just insists on 64-bit for the runtime check. I
imagine it would refuse to build on 32-bit machines if you passed it -msse42

- There is a placeholder for Windows support, but it's not developed.

- I had to add a large number of casts to get rid of warnings in the magic
constants macros. That needs some polish.

I also attached a C file that visually demonstrates every step of the
algorithm following the example found in Table 9 in the paper. That
contains the skeleton coding I started with and got abandoned early, so it
might differ from the actual patch.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v3-SSE4-with-autoconf-support.patch application/octet-stream 44.1 KB
test-utf8.c application/octet-stream 8.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2021-02-16 01:38:56 Re: PATCH: Batch/pipelining support for libpq
Previous Message Yugo NAGATA 2021-02-16 01:31:55 Re: Implementing Incremental View Maintenance