RE: [EXTERNAL] Re: speed up verifying UTF-8

From: "Godfrin, Philippe E" <Philippe(dot)Godfrin(at)nov(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: RE: [EXTERNAL] Re: speed up verifying UTF-8
Date: 2021-12-10 18:49:05
Message-ID: SA0PR15MB3933914472F7A062987AFCE382719@SA0PR15MB3933.namprd15.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>-----Original Message-----
>From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>Sent: Friday, December 10, 2021 12:34 PM
>To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>; Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
>Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>; Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>; Thomas Munro <thomas(dot)munro(at)gmail(dot)com>; Greg Stark <stark(at)mit(dot)edu>
>Subject: [EXTERNAL] Re: speed up verifying UTF-8
>
>On 20/10/2021 00:42, John Naylor wrote:
>> I've decided I'm not quite comfortable with the additional complexity
>> in the build system introduced by the SIMD portion of the previous patches.
>> It would make more sense if the pure C portion were unchanged, but
>> with the shift-based DFA plus the bitwise ASCII check, we have a
>> portable implementation that's still a substantial improvement over
>> the current validator. In v24, I've included only that much, and the
>> diff is only about 1/3 as many lines. If future improvements to COPY
>> FROM put additional pressure on this path, we can always add SIMD support later.
>
>+1.
>
>I had another look at this now. Looks good, just a few minor comments below:
>
>> +/*
>> + * Verify a chunk of bytes for valid ASCII, including a zero-byte check.
>> + */
>> +static inline bool
>> +is_valid_ascii(const unsigned char *s, int len) {
>> + uint64 chunk,
>> + highbit_cum = UINT64CONST(0),
>> + zero_cum = UINT64CONST(0x8080808080808080);
>> +
>> + Assert(len % sizeof(chunk) == 0);
>> +
>> + while (len >= sizeof(chunk))
>> + {
>> + memcpy(&chunk, s, sizeof(chunk));
>> +
>> + /*
>> + * Capture any zero bytes in this chunk.
>> + *
>> + * First, add 0x7f to each byte. This sets the high bit in each byte,
>> + * unless it was a zero. We will check later that none of the bytes in
>> + * the chunk had the high bit set, in which case the max value each
>> + * byte can have after the addition is 0x7f + 0x7f = 0xfe, and we
>> + * don't need to worry about carrying over to the next byte.
>> + *
>> + * If any resulting high bits are zero, the corresponding high bits in
>> + * the zero accumulator will be cleared.
>> + */
>> + zero_cum &= (chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f));
>> +
>> + /* Capture any set bits in this chunk. */
>> + highbit_cum |= chunk;
>> +
>> + s += sizeof(chunk);
>> + len -= sizeof(chunk);
>> + }
>
>This function assumes that the input len is a multiple of 8. There's an assertion for that, but it would be good to also mention it in the function comment. I took me a moment to realize that.
>
>Given that assumption, I wonder if "while (len >= 0)" would marginally faster. Or compute "s_end = s + len" first, and check for "while (s < s_end)", so that you don't need to update 'len' in the loop.
>
>Also would be good to mention what exactly the return value means. I.e "returns false if the input contains any bytes with the high-bit set, or zeros".
>
>> + /*
>> + * Check if any high bits in the zero accumulator got cleared.
>> + *
>> + * XXX: As noted above, the zero check is only valid if the chunk had no
>> + * high bits set. However, the compiler may perform these two checks in
>> + * any order. That's okay because if any high bits were set, we would
>> + * return false regardless, so invalid results from the zero check don't
>> + * matter.
>> + */
>> + if (zero_cum != UINT64CONST(0x8080808080808080))
>> + return false;
>
>I don't understand the "the compiler may perform these checks in any order" comment. We trust the compiler to do the right thing, and only reorder things when it's safe to do so. What is special here, why is it worth mentioning here?
>
>> @@ -1721,7 +1777,7 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
>> return s - start;
>> }
>>
>> -static int
>> +static pg_noinline int
>> pg_utf8_verifychar(const unsigned char *s, int len) {
>> int l;
>
>Why force it to not be inlined?
>
>> + * In a shift-based DFA, the input byte is an index into array of
>> + integers
>> + * whose bit pattern encodes the state transitions. To compute the
>> + current
>> + * state, we simply right-shift the integer by the current state and
>> + apply a
>> + * mask. In this scheme, the address of the transition only depends
>> + on the
>> + * input byte, so there is better pipelining.
>
>Should be "To compute the *next* state, ...", I think.
>
>The way the state transition table works is pretty inscrutable. That's understandable, because the values were found by an SMT solver, so I'm not sure if anything can be done about it.
>
>- Heikki
>

If I remember correctly the shift instruction is very fast...

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2021-12-10 19:03:17 Re: O(n) tasks cause lengthy startups and checkpoints
Previous Message Heikki Linnakangas 2021-12-10 18:33:48 Re: speed up verifying UTF-8