Re: [POC] verifying UTF-8 using SIMD instructions

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-08 10:17:11
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/02/2021 22:24, John Naylor wrote:
> Here is a more polished version of the function pointer approach, now
> adapted to all multibyte encodings. Using the not-yet-committed tests
> from [1], I found a thinko bug that resulted in the test for nul bytes
> to not only be wrong, but probably also elided by the compiler. Doing it
> correctly is noticeably slower on pure ascii, but still several times
> faster than before, so the conclusions haven't changed any. I'll run
> full measurements later this week, but I'll share the patch now for review.

As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's
algorithm from the simdjson library [1], see attached patch. I
microbenchmarked it using the the same test I used before [2].

These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1

unpatched master:

postgres=# \i mbverifystr-speed.sql
mixed | ascii
728 | 393
(1 row)


mixed | ascii
759 | 98
(1 row)


mixed | ascii
53 | 31
(1 row)

So clearly that algorithm is fast. Not sure if it has a high startup
cost, or large code size, or other tradeoffs that we don't want. At
least it depends on SIMD instructions, so it requires more code for the
architecture-specific implementations and autoconf logic and all that.
Nevertheless I think it deserves a closer look, I'm a bit reluctant to
put in half-way measures, when there's a clearly superior algorithm out

I also tested the fallback implementation from the simdjson library
(included in the patch, if you uncomment it in simdjson-glue.c):

mixed | ascii
447 | 46
(1 row)

I think we should at least try to adopt that. At a high level, it looks
pretty similar your patch: you load the data 8 bytes at a time, check if
there are all ASCII. If there are any non-ASCII chars, you check the
bytes one by one, otherwise you load the next 8 bytes. Your patch should
be able to achieve the same performance, if done right. I don't think
the simdjson code forbids \0 bytes, so that will add a few cycles, but


- Heikki

PS. Your patch as it stands isn't safe on systems with strict alignment,
the string passed to the verify function isn't guaranteed to be 8 bytes
aligned. Use memcpy to fetch the next 8-byte chunk to fix.

Attachment Content-Type Size
simdjson-utf8-hack.patch text/x-patch 5.5 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Borisov 2021-02-08 10:46:18 [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.
Previous Message Amit Kapila 2021-02-08 10:13:31 Re: repeated decoding of prepared transactions