Re: [POC] verifying UTF-8 using SIMD instructions

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-01 18:01:50
Message-ID: bc278857-f70b-716b-558f-6f9c1475fb61@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/02/2021 19:32, John Naylor wrote:
> It makes sense to start with the ascii subset of UTF-8 for a couple
> reasons. First, ascii is very widespread in database content,
> particularly in bulk loads. Second, ascii can be validated using the
> simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> I'm guessing we can detect that at compile time and not mess with
> runtime checks. The examples above using SSE for the general case are
> much more complicated and involve SSE 4.2 or AVX.

I wonder how using SSE compares with dealing with 64 or 32-bit words at
a time, using regular instructions? That would be more portable.

> Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is
> okay, I'll do Linux/gcc and add more inputs). The test is the same as
> Heikki shared in [3], but I added a case with >95% Chinese characters
> just to show how that compares to the mixed ascii/multibyte case.
>
> master:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>     1081 |   761 |   366
>
> patch:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>     1103 |   498 |    51
>
> The speedup in the pure ascii case is nice.

Yep.

> In the attached POC, I just have a pro forma portability stub, and left
> full portability detection for later. The fast path is inlined inside
> pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted
> into a separate function to which is passed a function pointer for full
> encoding validation. That would allow other encodings with strict ascii
> subsets to use this as well, but coding that abstraction might be a
> little messy, and b80e10638e3 already gives a performance boost over PG13.

All supported encodings are ASCII subsets. Might be best to putt the
ASCII-check into a static inline function and use it in all the verify
functions. I presume it's only a few instructions, and these functions
can be pretty performance sensitive.

> I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> far that has made performance worse. If I ever have more success with
> that, I'll add that in the mix.

That's disappointing. Perhaps the SIMD algorithms have higher startup
costs, so that you need longer inputs to benefit? In that case, it might
make sense to check the length of the input and only use the SIMD
algorithm if the input is long enough.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-02-01 18:02:16 Re: SEARCH and CYCLE clauses
Previous Message Magnus Hagander 2021-02-01 17:44:18 Re: Proposal: Save user's original authenticated identity for logging