Re: [POC] verifying UTF-8 using SIMD instructions

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-04 21:48:35
Message-ID: CAFBsxsGKo53PC06KKBkOeD4eEA=hsG-pqu7HQpKZeNBx3ubibw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 1, 2021 at 2:01 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 01/02/2021 19:32, John Naylor wrote:
> > It makes sense to start with the ascii subset of UTF-8 for a couple
> > reasons. First, ascii is very widespread in database content,
> > particularly in bulk loads. Second, ascii can be validated using the
> > simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> > I'm guessing we can detect that at compile time and not mess with
> > runtime checks. The examples above using SSE for the general case are
> > much more complicated and involve SSE 4.2 or AVX.
>
> I wonder how using SSE compares with dealing with 64 or 32-bit words at
> a time, using regular instructions? That would be more portable.

I gave that a shot, and it's actually pretty good. According to this paper,
[1], 16 bytes was best and gives a good apples-to-apples comparison to SSE
registers, so I tried both 16 and 8 bytes.

> All supported encodings are ASCII subsets. Might be best to putt the
> ASCII-check into a static inline function and use it in all the verify
> functions. I presume it's only a few instructions, and these functions
> can be pretty performance sensitive.

I tried both the static inline function and also putting the whole
optimized utf-8 loop in a separate function to which the caller passes a
pointer to the appropriate pg_*_verifychar().

In the table below, "inline" refers to coding directly inside
pg_utf8_verifystr(). Both C and SSE are in the same patch, with an #ifdef.
I didn't bother splitting them out because for other encodings, we want one
of the other approaches above. For those, "C retail" refers to a static
inline function to code the contents of the inner loop, if I understood
your suggestion correctly. This needs more boilerplate in each function, so
I don't prefer this. "C func pointer" refers to the pointer approach I just
mentioned. That is the cleanest looking way to generalize it, so I only
tested that version with different strides -- 8- and 16-bytes

This is the same test I used earlier, which is the test in [2] but adding
an almost-pure multibyte Chinese text of about the same size.

x64-64 Linux gcc 8.4.0:

build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1480 | 848 | 428
inline SSE | 1617 | 634 | 63
inline C | 1481 | 843 | 50
C retail | 1493 | 838 | 49
C func pointer | 1467 | 851 | 49
C func pointer 8 | 1518 | 757 | 56

x64-64 MacOS clang 10.0.0:

build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 1086 | 760 | 374
inline SSE | 1081 | 529 | 70
inline C | 1093 | 649 | 49
C retail | 1132 | 695 | 152
C func pointer | 1085 | 609 | 59
C func pointer 8 | 1099 | 571 | 71

PowerPC-LE Linux gcc 4.8.5:

build | chinese | mixed | ascii
------------------+---------+-------+-------
master | 2961 | 1525 | 871
inline SSE | (n/a) | (n/a) | (n/a)
inline C | 2911 | 1329 | 80
C retail | 2838 | 1311 | 102
C func pointer | 2828 | 1314 | 80
C func pointer 8 | 3143 | 1249 | 133

Looking at the results, the main advantage of SSE here is it's more robust
for mixed inputs. If a 16-byte chunk is not ascii-only but contains a block
of ascii at the front, we can skip those with a single CPU instruction, but
in C, we have to verify the whole chunk using the slow path.

The "C func pointer approach" seems to win out over the "C retail" approach
(static inline function).

Using an 8-byte stride is slightly better for mixed inputs on all platforms
tested, but regresses on pure ascii and also seems to regress on pure
multibyte. The difference in the multibyte caes is small enough that it
could be random, but it happens on two platforms, so I'd say it's real. On
the other hand, pure multibyte is not as common as mixed text.

Overall, I think the function pointer approach with an 8-byte stride is the
best balance. If that's agreeable, next I plan to test with short inputs,
because I think we'll want a guard if-statement to only loop through the
fast path if the string is long enough to justify that.

> > I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> > far that has made performance worse. If I ever have more success with
> > that, I'll add that in the mix.
>
> That's disappointing. Perhaps the SIMD algorithms have higher startup
> costs, so that you need longer inputs to benefit? In that case, it might
> make sense to check the length of the input and only use the SIMD
> algorithm if the input is long enough.

I changed topics a bit quickly, but here I'm talking about using a
table-driven state machine to verify the multibyte case. It's possible I
did something wrong, since my model implementation decodes, and having to
keep track of how many bytes got verified might be the culprit. I'd like to
try again to speed up multibyte, but that might be a PG15 project.

[1] https://arxiv.org/abs/2010.03090
[2]
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-02-04 22:20:49 Re: [HACKERS] Custom compression methods
Previous Message Thomas Munro 2021-02-04 21:48:26 Re: fdatasync(2) on macOS