Re: speed up verifying UTF-8

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-15 05:09:48
Message-ID: CAJ3gD9ejC+puY=Lgco2SGyD4tR46kye7qLZoskW0PXumtLcCpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 13 Jul 2021 at 01:15, John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> > It seems like it would be easy to have pg_utf8_verify_one in my proposed pg_utf8.h header and replace the body of pg_utf8_verifychar with it.
>
> 0001: I went ahead and tried this for v15, and also attempted some clean-up:
>
> - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
> - Have pg_utf8_verifychar_internal return -1 for invalid input to match other functions in the file. We could also do this for check_ascii, but it's not quite the same thing, because the string could still have valid bytes in it, just not enough to advance the pointer by the stride length.
> - Remove hard-coded numbers (not wedded to this).
>
> - Use a call to pg_utf8_verifychar in the slow path.
> - Reduce pg_utf8_verifychar to thin wrapper around pg_utf8_verifychar_internal.

- check_ascii() seems to be used only for 64-bit chunks. So why not
remove the len argument and the len <= sizeof(int64) checks inside the
function. We can rename it to check_ascii64() for clarity.

- I was thinking, why not have a pg_utf8_verify64() that processes
64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
extract a 64-bit chunk from the string. We can use the same chunk to
extract the required bits from a two byte char or a 4 byte char. This
way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
etc. More importantly, we can avoid the separate continuation-char
checks for each individual byte. Additionally, we can try to simplify
the subsequent overlong or surrogate char checks. Something like this
:

int pg_utf8_verifychar_32(uint32 chunk)
{
int len, l;

for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l))
{
/* Is 2-byte lead */
if ((chunk & 0xF0000000) == 0xC0000000)
{
l = 2;
/* ....... ....... */
}
/* Is 3-byte lead */
else if ((chunk & 0xF0000000) == 0xE0000000)
{
l = 3;
if (len < l)
break;

/* b2 and b3 should be continuation bytes */
if ((chunk & 0x00C0C000) != 0x00808000)
return sizeof(chunk) - len;

switch (chunk & 0xFF200000)
{
/* check 3-byte overlong: 1110.0000 1001.xxxx 10xx.xxxx
* i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2
is of the form
* 10xx since it's a continuation char. Additionally
condition b2 <=
* 0x9F means it is of the form 100x.xxxx. i.e.
either 1000.xxxx
* or 1001.xxxx. So just verify that it is xx0x.xxxx
*/
case 0xE0000000:
return sizeof(chunk) - len;

/* check surrogate: 1110.1101 101x.xxxx 10xx.xxxx
* i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either
* 1010.xxxx, 1011.xxxx, 1100.xxxx, or 1110.xxxx. Last
two are not
* possible because b2 is a continuation char. So it has to be
* first two. So just verify that it is xx1x.xxxx
*/
case 0xED200000:
return sizeof(chunk) - len;
default:
;
}

}
/* Is 4-byte lead */
else if ((chunk & 0xF0000000) == 0xF0000000)
{
/* ......... */
l = 4;
}
else
return sizeof(chunk) - len;
}
return sizeof(chunk) - len;
}

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message osumi.takamichi@fujitsu.com 2021-07-15 05:50:45 RE: logical replication empty transactions
Previous Message Kyotaro Horiguchi 2021-07-15 04:51:31 Re: ERROR: "ft1" is of the wrong type.