Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-15 18:12:43
Message-ID: CAFBsxsEVoO-cGN7Q7H+ytuExSfnm0xm19CMbjs2Q5a+7LXX_rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 15, 2021 at 1:10 AM Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
wrote:

> - check_ascii() seems to be used only for 64-bit chunks. So why not
> remove the len argument and the len <= sizeof(int64) checks inside the
> function. We can rename it to check_ascii64() for clarity.

Thanks for taking a look!

Well yes, but there's nothing so intrinsic to 64 bits that the name needs
to reflect that. Earlier versions worked on 16 bytes at time. The compiler
will optimize away the len check, but we could replace with an assert
instead.

> - I was thinking, why not have a pg_utf8_verify64() that processes
> 64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
> extract a 64-bit chunk from the string. We can use the same chunk to
> extract the required bits from a two byte char or a 4 byte char. This
> way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
> etc.

Loading bytes from L1 is really fast -- I wouldn't even call it
"extraction".

> More importantly, we can avoid the separate continuation-char
> checks for each individual byte.

On a pipelined superscalar CPU, I wouldn't expect it to matter in the
slightest.

> Additionally, we can try to simplify
> the subsequent overlong or surrogate char checks. Something like this

My recent experience with itemptrs has made me skeptical of this kind of
thing, but the idea was interesting enough that I couldn't resist trying it
out. I have two attempts, which are attached as v16*.txt and apply
independently. They are rough, and some comments are now lies. To simplify
the constants, I do shift down to uint32, and I didn't bother working
around that. v16alpha regressed on worst-case input, so for v16beta I went
back to earlier coding for the one-byte ascii check. That helped, but it's
still slower than v14.

That was not unexpected, but I was mildly shocked to find out that v15 is
also slower than the v14 that Heikki posted. The only non-cosmetic
difference is using pg_utf8_verifychar_internal within pg_utf8_verifychar.
I'm not sure why it would make such a big difference here. The numbers on
Power8 / gcc 4.8 (little endian):

HEAD:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2951 | 1521 | 871 | 1474 | 1508

v14:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
885 | 607 | 179 | 774 | 1325

v15:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1085 | 671 | 180 | 1032 | 1799

v16alpha:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1268 | 822 | 180 | 1410 | 2518

v16beta:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1096 | 654 | 182 | 814 | 1403

As it stands now, for v17 I'm inclined to go back to v15, but without the
attempt at being clever that seems to have slowed it down from v14.

Any interest in testing on 64-bit Arm?

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v16alpha-Rewrite-pg_utf8_verifystr-for-speed.txt text/plain 12.8 KB
v16beta-Rewrite-pg_utf8_verifystr-for-speed.txt text/plain 12.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2021-07-15 18:14:28 Re: pg_upgrade does not upgrade pg_stat_statements properly
Previous Message Bruce Momjian 2021-07-15 17:59:26 Re: Using a stock openssl BIO