Re: speed up verifying UTF-8

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: speed up verifying UTF-8
Date: 2021-12-10 18:33:48
Message-ID: b4648cc2-5e9c-c93a-52cc-51e5c658a4f6@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 20/10/2021 00:42, John Naylor wrote:
> I've decided I'm not quite comfortable with the additional complexity in
> the build system introduced by the SIMD portion of the previous patches.
> It would make more sense if the pure C portion were unchanged, but with
> the shift-based DFA plus the bitwise ASCII check, we have a portable
> implementation that's still a substantial improvement over the current
> validator. In v24, I've included only that much, and the diff is only
> about 1/3 as many lines. If future improvements to COPY FROM put
> additional pressure on this path, we can always add SIMD support later.

+1.

I had another look at this now. Looks good, just a few minor comments below:

> +/*
> + * Verify a chunk of bytes for valid ASCII, including a zero-byte check.
> + */
> +static inline bool
> +is_valid_ascii(const unsigned char *s, int len)
> +{
> + uint64 chunk,
> + highbit_cum = UINT64CONST(0),
> + zero_cum = UINT64CONST(0x8080808080808080);
> +
> + Assert(len % sizeof(chunk) == 0);
> +
> + while (len >= sizeof(chunk))
> + {
> + memcpy(&chunk, s, sizeof(chunk));
> +
> + /*
> + * Capture any zero bytes in this chunk.
> + *
> + * First, add 0x7f to each byte. This sets the high bit in each byte,
> + * unless it was a zero. We will check later that none of the bytes in
> + * the chunk had the high bit set, in which case the max value each
> + * byte can have after the addition is 0x7f + 0x7f = 0xfe, and we
> + * don't need to worry about carrying over to the next byte.
> + *
> + * If any resulting high bits are zero, the corresponding high bits in
> + * the zero accumulator will be cleared.
> + */
> + zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
> +
> + /* Capture any set bits in this chunk. */
> + highbit_cum |= chunk;
> +
> + s += sizeof(chunk);
> + len -= sizeof(chunk);
> + }

This function assumes that the input len is a multiple of 8. There's an
assertion for that, but it would be good to also mention it in the
function comment. I took me a moment to realize that.

Given that assumption, I wonder if "while (len >= 0)" would marginally
faster. Or compute "s_end = s + len" first, and check for "while (s <
s_end)", so that you don't need to update 'len' in the loop.

Also would be good to mention what exactly the return value means. I.e
"returns false if the input contains any bytes with the high-bit set, or
zeros".

> + /*
> + * Check if any high bits in the zero accumulator got cleared.
> + *
> + * XXX: As noted above, the zero check is only valid if the chunk had no
> + * high bits set. However, the compiler may perform these two checks in
> + * any order. That's okay because if any high bits were set, we would
> + * return false regardless, so invalid results from the zero check don't
> + * matter.
> + */
> + if (zero_cum != UINT64CONST(0x8080808080808080))
> + return false;

I don't understand the "the compiler may perform these checks in any
order" comment. We trust the compiler to do the right thing, and only
reorder things when it's safe to do so. What is special here, why is it
worth mentioning here?

> @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
> return s - start;
> }
>
> -static int
> +static pg_noinline int
> pg_utf8_verifychar(const unsigned char *s, int len)
> {
> int l;

Why force it to not be inlined?

> + * In a shift-based DFA, the input byte is an index into array of integers
> + * whose bit pattern encodes the state transitions. To compute the current
> + * state, we simply right-shift the integer by the current state and apply a
> + * mask. In this scheme, the address of the transition only depends on the
> + * input byte, so there is better pipelining.

Should be "To compute the *next* state, ...", I think.

The way the state transition table works is pretty inscrutable. That's
understandable, because the values were found by an SMT solver, so I'm
not sure if anything can be done about it.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Godfrin, Philippe E 2021-12-10 18:49:05 RE: [EXTERNAL] Re: speed up verifying UTF-8
Previous Message Godfrin, Philippe E 2021-12-10 18:10:54 RE: [EXTERNAL] Re: should we document an example to set multiple libraries in shared_preload_libraries?