Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-28 18:12:11
Message-ID: CAFBsxsH=jfWgo7-ToygfdjnC60C3V_N=6=EoCfQ50U3cED_W8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:

> On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <
sitnikov(dot)vladimir(at)gmail(dot)com> wrote:
> >
> > >+ utf8_advance(s, state, len);
> > >+
> > >+ /*
> > >+ * If we saw an error during the loop, let the caller handle it. We
treat
> > >+ * all other states as success.
> > >+ */
> > >+ if (state == ERR)
> > >+ return 0;
> >
> > Did you mean state = utf8_advance(s, state, len); there? (reassign
state variable)
>
> Yep, that's a bug, thanks for catching!

Fixed in v21, with a regression test added. Also, utf8_advance() now
directly changes state by a passed pointer rather than returning a value.
Some cosmetic changes:

s/valid_bytes/non_error_bytes/ since the former is kind of misleading now.

Some other var name and symbol changes. In my first DFA experiment, ASC
conflicted with the parser or scanner somehow, but it doesn't here, so it's
clearer to use this.

Rewrote a lot of comments about the state machine and regression tests.
--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v21-0001-Add-fast-paths-for-validating-UTF-8-text.patch application/octet-stream 63.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-07-28 18:16:41 Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep
Previous Message Andres Freund 2021-07-28 18:10:46 Re: Asynchronous and "direct" IO support for PostgreSQL.