Quick Links

Re: speed up verifying UTF-8

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	Re: speed up verifying UTF-8
Date:	2021-12-08 18:11:46
Message-ID:	CAFBsxsEnrzO4=-Cue=8n6P+Jr348FG-kEXLeMGdXycUOt1obAg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

It occurred to me that the DFA + ascii quick check approach could also
be adapted to speed up some cases where we currently walk a string
counting characters, like this snippet in
text_position_get_match_pos():

/* Convert the byte position to char position. */
while (state->refpoint < state->last_match)
{
state->refpoint += pg_mblen(state->refpoint);
state->refpos++;
}

This coding changed in 9556aa01c69 (Use single-byte
Boyer-Moore-Horspool search even with multibyte encodings), in which I
found the majority of cases were faster, but some were slower. It
would be nice to regain the speed lost and do even better.

In the case of UTF-8, we could just run it through the DFA,
incrementing a count of the states found. The number of END states
should be the number of characters. The ascii quick check would still
be applicable as well. I think all that is needed is to export some
symbols and add the counting function. That wouldn't materially affect
the current patch for input verification, and would be separate, but
it would be nice to get the symbol visibility right up front. I've set
this to waiting on author while I experiment with that.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Re: speed up verifying UTF-8 at 2021-10-19 21:42:40 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David G. Johnston	2021-12-08 18:29:57	Re: Cross DB query
Previous Message	Marcos Pegoraro	2021-12-08 18:09:42	Cross DB query