Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-12 19:45:39
Message-ID: CAFBsxsGB=dSBee2M+5-OntnkLgh_LajmW4P+dXhesnmbijfQLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:

> I don't think the new structuring will pose any challenges for rebasing
0002, either. This might need some experimentation, though:
>
> + * Subroutine of pg_utf8_verifystr() to check on char. Returns the
length of the
> + * character at *s in bytes, or 0 on invalid input or premature end of
input.
> + *
> + * XXX: could this be combined with pg_utf8_verifychar above?
> + */
> +static inline int
> +pg_utf8_verify_one(const unsigned char *s, int len)
>
> It seems like it would be easy to have pg_utf8_verify_one in my proposed
pg_utf8.h header and replace the body of pg_utf8_verifychar with it.

0001: I went ahead and tried this for v15, and also attempted some clean-up:

- Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
- Have pg_utf8_verifychar_internal return -1 for invalid input to match
other functions in the file. We could also do this for check_ascii, but
it's not quite the same thing, because the string could still have valid
bytes in it, just not enough to advance the pointer by the stride length.
- Remove hard-coded numbers (not wedded to this).

- Use a call to pg_utf8_verifychar in the slow path.
- Reduce pg_utf8_verifychar to thin wrapper around
pg_utf8_verifychar_internal.

The last two aren't strictly necessary, but it prevents bloating the binary
in the slow path, and aids readability. For 0002, this required putting
pg_utf8_verifychar* in src/port. (While writing this I noticed I neglected
to explain that with a comment, though)

Feedback welcome on any of the above.

Since by now it hardly resembles the simdjson (or Fuchsia for that matter)
fallback that it took inspiration from, I've removed that mention from the
commit message.

0002: Just a rebase to work with the above. One possible review point: We
don't really need to have separate control over whether to use special
instructions for CRC and UTF-8. It should probably be just one configure
knob, but having them separate is perhaps easier to review.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v15-0001-Rewrite-pg_utf8_verifystr-for-speed.patch application/octet-stream 13.5 KB
v15-0002-Use-SSE-instructions-for-pg_utf8_verifystr-where.patch application/octet-stream 50.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-07-12 19:53:08 Re: row filtering for logical replication
Previous Message Peter Eisentraut 2021-07-12 19:39:42 Re: [PATCH v3 1/1] Fix detection of preadv/pwritev support for OSX.