Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-06-30 16:54:23
Message-ID: CAFBsxsGZ_ssdVmOK5qbcO5on87ByyDvW3APRohR=kCfb8Z3XVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 30, 2021 at 7:18 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> Hmm, there's one more simple trick we can do: We can have a separate
> fast-path version of the loop when there are at least 8 bytes of input
> left, skipping all the length checks. With that:

Good idea, and the numbers look good on Power8 / gcc 4.8 as well:

master:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2951 | 1521 | 871 | 1473 | 1508

v13:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
949 | 642 | 203 | 1046 | 1818

v14:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
887 | 607 | 179 | 776 | 1325

I don't think the new structuring will pose any challenges for rebasing
0002, either. This might need some experimentation, though:

+ * Subroutine of pg_utf8_verifystr() to check on char. Returns the length
of the
+ * character at *s in bytes, or 0 on invalid input or premature end of
input.
+ *
+ * XXX: could this be combined with pg_utf8_verifychar above?
+ */
+static inline int
+pg_utf8_verify_one(const unsigned char *s, int len)

It seems like it would be easy to have pg_utf8_verify_one in my proposed
pg_utf8.h header and replace the body of pg_utf8_verifychar with it.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-06-30 17:13:16 Re: [PATCH] Make jsonapi usable from libpq
Previous Message David Christensen 2021-06-30 16:53:03 [PATCH] pgbench: add multiconnect option