Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-26 11:09:00
Message-ID: CAFBsxsHR08mHEf06PvrMRstfcyPJLwF69g0r1pvRrxWD4GEVoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is v20, which has a number of improvements:

1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to
produce an optimized loop, using branch-free accumulators. That way, it
doesn't need to be rewritten for different input lengths. I also think it's
a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.

With #2 above in place, I wanted to try different strides for the DFA, so
more measurements (hopefully not much more of these):

Power8, gcc 4.8

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
2944 | 1523 | 871 | 1473 | 1509

v20, 8-byte stride:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1189 | 550 | 246 | 600 | 936

v20, 16-byte stride (in the actual patch):
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
981 | 440 | 134 | 791 | 820

v20, 32-byte stride:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
857 | 481 | 141 | 834 | 839

Based on the above, I decided that 16 bytes had the best overall balance.
Other platforms may differ, but I don't expect it to make a huge amount of
difference.

Just for fun, I was also a bit curious about what Vladimir mentioned
upthread about x86-64-v3 offering a different shift instruction. Somehow,
clang 12 refused to build with that target, even though the release notes
say it can, but gcc 11 was fine:

x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1200 | 728 | 370 | 544 | 637

v20:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
459 | 243 | 77 | 424 | 440

v20, CFLAGS="-march=x86-64-v3 -O2" :
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
390 | 215 | 77 | 303 | 323

And, gcc does generate the desired shift here:

objdump -S src/port/pg_utf8_fallback.o | grep shrx
53: c4 e2 eb f7 d1 shrxq %rdx, %rcx, %rdx

While it looks good, clang can do about as good by simply unrolling all 16
shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since
x86-64-v3 includes AVX2, and if we had that we would just use it with the
SIMD algorithm.

Macbook x86, clang 12:

HEAD:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
974 | 691 | 370 | 456 | 526

v20, USE_FALLBACK_UTF8=1:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
351 | 172 | 88 | 349 | 350

v20, with SSE4:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
142 | 92 | 59 | 141 | 141

I'm pretty happy with the patch at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v20-0001-Add-a-fast-path-for-validating-UTF-8-text.patch application/x-patch 60.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladimir Sitnikov 2021-07-26 11:55:29 Re: speed up verifying UTF-8
Previous Message Kyotaro Horiguchi 2021-07-26 08:52:01 Re: shared-memory based stats collector