Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-06-03 15:33:21
Message-ID: CAFBsxsH1kPY+us49gXVf6zv6bLPC5NHm1f1VW+UdpVH8i+JD4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 3, 2021 at 10:42 AM Greg Stark <stark(at)mit(dot)edu> wrote:
>
> I haven't looked at the surrounding code. Are we processing all the
> COPY data in one long stream or processing each field individually?

It happens on 64kB chunks.

> If
> we're processing much more than 128 bits and happy to detect NUL
> errors only at the end after wasting some work then you could hoist
> that has_zero check entirely out of the loop (removing the branch
> though it's probably a correctly predicted branch anyways).
>
> Do something like:
>
> zero_accumulator = zero_accumulator & next_chunk
>
> in the loop and then only at the very end check for zeros in that.

That's the approach taken in the SSE4 patch, and in fact that's the logical
way to do it there. I hadn't considered doing it that way in the pure C
case, but I think it's worth trying.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Lepikhov 2021-06-03 15:33:56 Re: Asynchronous Append on postgres_fdw nodes.
Previous Message Nitin Jadhav 2021-06-03 14:45:09 Re: Multi-Column List Partitioning