Re: speed up verifying UTF-8

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-06-03 13:15:59
Message-ID: 2356cfd5-f8fd-0d5c-3a1c-28e170d89b6e@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/06/2021 19:26, John Naylor wrote:
> For v10, I've split the patch up into two parts. 0001 uses pure C
> everywhere. This is much smaller and easier to review, and gets us the
> most bang for the buck.
>
> One concern Heikki raised upthread is that platforms with poor
> unaligned-memory access will see a regression. We could easily add an
> #ifdef to take care of that, but I haven't done so here.
>
> To recap: On ascii-only input with storage taken out of the picture,
> profiles of COPY FROM show a reduction from nealy 10% down to just over
> 1%. In microbenchmarks found earlier in this thread, this works out to
> about 7 times faster. On multibyte/mixed input, 0001 is a bit faster,
> but not really enough to make a difference in copy performance.

Nice!

This kind of bit-twiddling is fun, so I couldn't resist tinkering with
it, to see if we can shave some more instructions from it:

> +/* from https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */
> +#define HAS_ZERO(chunk) ( \
> + ((chunk) - UINT64CONST(0x0101010101010101)) & \
> + ~(chunk) & \
> + UINT64CONST(0x8080808080808080))
> +
> +/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
> +static inline int
> +check_ascii(const unsigned char *s, int len)
> +{
> + uint64 half1,
> + half2,
> + highbits_set;
> +
> + if (len >= 2 * sizeof(uint64))
> + {
> + memcpy(&half1, s, sizeof(uint64));
> + memcpy(&half2, s + sizeof(uint64), sizeof(uint64));
> +
> + /* If there are zero bytes, bail and let the slow path handle it. */
> + if (HAS_ZERO(half1) || HAS_ZERO(half2))
> + return 0;
> +
> + /* Check if any bytes in this chunk have the high bit set. */
> + highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
> +
> + if (!highbits_set)
> + return 2 * sizeof(uint64);
> + else
> + return 0;
> + }
> + else
> + return 0;
> +}

Some ideas:

1. Better to check if any high bits are set first. We care more about
the speed of that than of detecting zero bytes, because input with high
bits is valid but zeros are an error.

2. Since we check that there are no high bits, we can do the zero-checks
with fewer instructions like this:

/* NB: this is only correct if 'chunk' doesn't have any high bits set */
#define HAS_ZERO(chunk) ( \
((chunk) + \
UINT64CONST(0x7f7f7f7f7f7f7f7f)) & \
UINT64CONST(0x8080808080808080) == UINT64CONST(0x8080808080808080))

3. It's probably cheaper perform the HAS_ZERO check just once on (half1
| half2). We have to compute (half1 | half2) anyway.

Putting all that together:

/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
static inline int
check_ascii(const unsigned char *s, int len)
{
uint64 half1,
half2,
highbits_set;
uint64 x;

if (len >= 2 * sizeof(uint64))
{
memcpy(&half1, s, sizeof(uint64));
memcpy(&half2, s + sizeof(uint64), sizeof(uint64));

/* Check if any bytes in this chunk have the high bit set. */
highbits_set = ((half1 | half2) & UINT64CONST(0x8080808080808080));
if (highbits_set)
return 0;

/*
* Check if there are any zero bytes in this chunk. This is only correct
* if there are no high bits set, but we checked that already.
*/
x = (half1 | half2) + UINT64CONST(0x7f7f7f7f7f7f7f7f);
x &= UINT64CONST(0x8080808080808080);
if (x != UINT64CONST(0x8080808080808080))
return 0;

return 2 * sizeof(uint64);
}
else
return 0;
}

In quick testing, that indeed compiles into fewer instructions. With
GCC, there's no measurable difference in performance. But with clang,
this version is much faster than the original, because the original
version is much slower than when compiled with GCC. In other words, this
version seems to avoid some clang misoptimization. I tested only with
ASCII input, I haven't tried other cases.

What test set have you been using for performance testing this? I'd like
to know how this version compares, and I could also try running it on my
old raspberry pi, which is more strict about alignmnt.

> 0002 adds the SSE4 implementation on x86-64, and is equally fast on all
> input, at the cost of greater complexity.

Didn't look closely, but seems reasonable at a quick glance.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-06-03 13:41:24 missing GRANT on pg_subscription columns
Previous Message Amit Langote 2021-06-03 13:08:45 Re: parent foreign tables and row marks