[POC] verifying UTF-8 using SIMD instructions

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-01 17:32:23
Message-ID: CAFBsxsEV_SzH+OLyCiyon=iwggSyMh_eF6A3LU2tiWf3Cy2ZQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

As of b80e10638e3, there is a new API for validating the encoding of
strings, and one of the side effects is that we have a wider choice of
algorithms. For UTF-8, it has been demonstrated that SIMD is much faster at
decoding [1] and validation [2] than the standard approach we use.

It makes sense to start with the ascii subset of UTF-8 for a couple
reasons. First, ascii is very widespread in database content, particularly
in bulk loads. Second, ascii can be validated using the simple SSE2
intrinsics that come with (I believe) any x64-64 chip, and I'm guessing we
can detect that at compile time and not mess with runtime checks. The
examples above using SSE for the general case are much more complicated and
involve SSE 4.2 or AVX.

Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is
okay, I'll do Linux/gcc and add more inputs). The test is the same as
Heikki shared in [3], but I added a case with >95% Chinese characters just
to show how that compares to the mixed ascii/multibyte case.

master:

chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366

patch:

chinese | mixed | ascii
---------+-------+-------
1103 | 498 | 51

The speedup in the pure ascii case is nice.

In the attached POC, I just have a pro forma portability stub, and left
full portability detection for later. The fast path is inlined inside
pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted into
a separate function to which is passed a function pointer for full encoding
validation. That would allow other encodings with strict ascii subsets to
use this as well, but coding that abstraction might be a little messy, and
b80e10638e3 already gives a performance boost over PG13.

I also gave a shot at doing full UTF-8 recognition using a DFA, but so far
that has made performance worse. If I ever have more success with that,
I'll add that in the mix.

[1] https://woboq.com/blog/utf-8-processing-using-simd.html
[2]
https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
[3]
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v1-verify-utf8-sse-ascii.patch application/x-patch 2.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-02-01 17:43:59 Re: Proposal: Save user's original authenticated identity for logging
Previous Message Tom Lane 2021-02-01 17:32:09 Re: Proposal: Save user's original authenticated identity for logging