Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: speed up verifying UTF-8
Date: 2021-10-19 21:42:40
Message-ID: CAFBsxsHUgNeytyF6TyoUBgf8whqRxvStbWtok9qcDJzDZ78FLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've decided I'm not quite comfortable with the additional complexity in
the build system introduced by the SIMD portion of the previous patches. It
would make more sense if the pure C portion were unchanged, but with the
shift-based DFA plus the bitwise ASCII check, we have a portable
implementation that's still a substantial improvement over the current
validator. In v24, I've included only that much, and the diff is only about
1/3 as many lines. If future improvements to COPY FROM put additional
pressure on this path, we can always add SIMD support later.

One thing not in this patch is a possible improvement to
pg_utf8_verifychar() that Heikki and I worked on upthread as part of
earlier attempts to rewrite pg_utf8_verifystr(). That's worth looking into
separately.

On Thu, Aug 26, 2021 at 12:09 PM Vladimir Sitnikov <
sitnikov(dot)vladimir(at)gmail(dot)com> wrote:
>
> >Attached is v23 incorporating the 32-bit transition table, with the
necessary comment adjustments
>
> 32bit table is nice.

Thanks for taking a look!

> Would you please replace
https://github.com/BobSteagall/utf_utils/blob/master/src/utf_utils.cpp URL
with
>
https://github.com/BobSteagall/utf_utils/blob/6b7a465265de2f5fa6133d653df0c9bdd73bbcf8/src/utf_utils.cpp
> in the header of src/port/pg_utf8_fallback.c?
>
> It would make the URL more stable in case the file gets renamed.
>
> Vladimir
>

Makes sense, so done that way.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v24-0001-Add-fast-path-for-validating-UTF-8-text.patch application/octet-stream 23.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-10-19 21:57:31 Re: [RFC] building postgres with meson
Previous Message Isaac Morland 2021-10-19 21:29:16 Re: CREATE ROLE IF NOT EXISTS