Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-20 21:24:33
Message-ID: CAFBsxsHP556FGZZZi=gmYxZFXv0QjPXHWmbp1HdeFPYGVUQ-XQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <
sitnikov(dot)vladimir(at)gmail(dot)com> wrote:

> > An alternative idea: should we optimize for validation of **valid**
inputs rather than optimizing the worst case?
> > In other words, what if the implementation processes all characters
always and uses a slower method in case of validation failure?
> > I would guess it is more important to be faster with accepting valid
input rather than "faster to reject invalid input".
>
> > static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> > if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid,
then just accept it
> > return s + len;
> > }
> > // slow path: the string is not valid, perform a slower analysis
> > return s + ....;
> > }

This turned out to be a really good idea (v19 attached):

Power8, gcc 4.8:

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2944 | 1523 | 871 | 1473 | 1509

v14:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
888 | 607 | 179 | 777 | 1328

v19:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
809 | 472 | 223 | 558 | 805

x86 Macbook, clang 12:

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
974 | 691 | 370 | 456 | 526

v14:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
674 | 346 | 78 | 309 | 504

v19:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
379 | 181 | 94 | 219 | 376

Note that the branchy code's worst case (mixed8) is here the same speed as
multibyte. With Vladimir's idea * , we call check_ascii only every 8 bytes
of input, not every time we verify one multibyte character. Also, we only
have to check the DFA state every time we loop over 8 bytes, not every time
we step through the DFA. That means we have to walk backwards at the end to
find the last leading byte, but the SSE code already knew how to do that,
so I used that logic here in the caller, which will allow some
simplification of how the SSE code returns.

The state check is likely why the ascii case is slightly slower than v14.
We could go back to checking ascii 16 bytes at a time, since there's little
penalty for doing so.

* (Greg was thinking the same thing upthread, but I don't think the branchy
code I posted at the time could have taken advantage of this)

I'm pretty confident this improvement is architecture-independent. Next
month I'll clean this up and rebase the SSE patch over this.

I wrote:

> + /*
> + * NB: This check must be strictly greater-than, otherwise an invalid
byte
> + * at the end might not get detected.
> + */
> + while (len > sizeof(__m128i))

Note to self: I actually think this isn't needed anymore since I changed
how the SSE code deals with remainder sequences at the end.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v19-rewrite-pg_verify_str-for-speed.patch application/octet-stream 15.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2021-07-20 21:33:07 Re: Have I found an interval arithmetic bug?
Previous Message Tomas Vondra 2021-07-20 20:27:09 Re: logical decoding and replication of sequences