From: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
---|---|
To: | Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com> |
Cc: | Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: speed up verifying UTF-8 |
Date: | 2021-07-19 15:07:15 |
Message-ID: | CAFBsxsGL6sKNdJZQ78Cq3isAmun2EeWv1_O=PHp3f1woCWebjA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <
sitnikov(dot)vladimir(at)gmail(dot)com> wrote:
> It looks like it is important to have shrx for x86 which appears only
when -march=x86-64-v3 is used (see
https://github.com/golang/go/issues/47120#issuecomment-877629712 ).
> Just in case: I know x86 wound not use fallback implementation, however,
the sole purpose of shift-based DFA is to fold all the data-dependent ops
into a single instruction.
I saw mention of that instruction, but didn't understand how important it
was, thanks.
> An alternative idea: should we optimize for validation of **valid**
inputs rather than optimizing the worst case?
> In other words, what if the implementation processes all characters
always and uses a slower method in case of validation failure?
> I would guess it is more important to be faster with accepting valid
input rather than "faster to reject invalid input".
> static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid,
then just accept it
> return s + len;
> }
> // slow path: the string is not valid, perform a slower analysis
> return s + ....;
> }
That might be workable. We have to be careful because in COPY FROM,
validation is performed on 64kB chunks, and the boundary could fall in the
middle of a multibyte sequence. In the SSE version, there is this comment:
+ /*
+ * NB: This check must be strictly greater-than, otherwise an invalid byte
+ * at the end might not get detected.
+ */
+ while (len > sizeof(__m128i))
...which should have more detail on this.
--
John Naylor
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2021-07-19 15:31:04 | Re: Added documentation for cascade and restrict option of drop statistics |
Previous Message | Dilip Kumar | 2021-07-19 14:59:06 | Re: refactoring basebackup.c |