Re: [POC] verifying UTF-8 using SIMD instructions

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-09 21:12:22
Message-ID: CAFBsxsFU7C5cHCLfERcf+nNTvCJcW-hBboJP4shwKVvm-qegbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
>
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> One of his earlier demos [1] (in simdutf8check.h) had a version that used
mostly SSE2 with just three intrinsics from SSSE3. That's widely available
by now. He measured that at 0.7 cycles per byte, which is still good
compared to AVX2 0.45 cycles per byte [2].
>
> Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would
assume that if that check (and the corresponding runtime check) passes, we
can assume SSE2. That code has three licenses to choose from -- Apache 2,
Boost, and MIT. Something like that might be straightforward to start from.
I think the only obstacles to worry about are license and getting it to fit
into our codebase. Adding more than zero high-level comments with a good
description of how it works in detail is also a bit of a challenge.

I double checked, and it's actually two SSSE3 intrinsics and one SSE4.1,
but the 4.1 one can be emulated with a few SSE2 intrinsics. But we could
probably fold all three into the SSE4.2 CRC check and have a single symbol
to save on boilerplate.

I hacked that demo [1] into wchar.c (very ugly patch attached), and got the
following:

master

mixed | ascii
-------+-------
757 | 366

Lemire demo:

mixed | ascii
-------+-------
172 | 168

This one lacks an ascii fast path, but the AVX2 version in the same file
has one that could probably be easily adapted. With that, I think this
would be worth adapting to our codebase and license. Thoughts?

The advantage of this demo is that it's not buried in a mountain of modern
C++.

Simdjson can use AVX -- do you happen to know which target it got compiled
to? AVX vectors are 256-bits wide and that requires OS support. The OS's we
care most about were updated 8-12 years ago, but that would still be
something to check, in addition to more configure checks.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
utf-sse42-demo.patch application/octet-stream 6.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-02-09 22:14:06 64-bit XIDs in deleted nbtree pages
Previous Message Robert Haas 2021-02-09 20:59:32 Re: [HACKERS] Custom compression methods