Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-15 22:00:05
Message-ID: CAFBsxsEzzTR=Zd=HnT2TZcQ8So1AzWbD1xXUvRsos8w-0C_nPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:

> To simplify the constants, I do shift down to uint32, and I didn't bother
working around that. v16alpha regressed on worst-case input, so for v16beta
I went back to earlier coding for the one-byte ascii check. That helped,
but it's still slower than v14.

It occurred to me that I could rewrite the switch test into simple
comparisons, like I already had for the 2- and 4-byte lead cases. While at
it, I folded the leading byte and continuation tests into a single
operation, like this:

/* 3-byte lead with two continuation bytes */
else if ((chunk & 0xF0C0C00000000000) == 0xE080800000000000)

...and also tried using 64-bit constants to avoid shifting. Still didn't
quite beat v14, but got pretty close:

> The numbers on Power8 / gcc 4.8 (little endian):
>
> HEAD:
>
> chinese | mixed | ascii | mixed16 | mixed8
> ---------+-------+-------+---------+--------
> 2951 | 1521 | 871 | 1474 | 1508
>
> v14:
>
> chinese | mixed | ascii | mixed16 | mixed8
> ---------+-------+-------+---------+--------
> 885 | 607 | 179 | 774 | 1325

v16gamma:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
952 | 632 | 180 | 800 | 1333

A big-endian 64-bit platform just might shave enough cycles to beat v14
this way... or not.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v16gamma-Rewrite-pg_utf8_verifystr-for-speed.txt text/plain 12.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-07-15 22:32:07 Re: data corruption hazard in reorderbuffer.c
Previous Message Mark Dilger 2021-07-15 21:17:32 Re: data corruption hazard in reorderbuffer.c