Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-06-29 11:20:38
Message-ID: CAFBsxsH9xJpru2U6_ua963LV8LP34=bJRaESUTUS1mH6Y-m+_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I still wasn't quite happy with the churn in the regression tests, so for
v13 I gave up on using both the existing utf8 table and my new one for the
"padded input" tests, and instead just copied the NUL byte test into the
new table. Also added a primary key to make sure the padded test won't give
weird results if a new entry has a duplicate description.

I came up with "highbit_carry" as a more descriptive variable name than
"x", but that doesn't matter a whole lot.

It also occurred to me that if we're going to check one 8-byte chunk at a
time (like v12 does), maybe it's only worth it to load 8 bytes at a time.
An earlier version did this, but without the recent tweaks. The worst-case
scenario now might be different from the one with 16-bytes, but for now
just tested the previous worst case (mixed2). Only tested on ppc64le, since
I'm hoping x86 will get the SIMD algorithm (I'm holding off rebasing 0002
until 0001 settles down).

Power8, Linux, gcc 4.8

master:
chinese | mixed | ascii | mixed2
---------+-------+-------+--------
2952 | 1520 | 871 | 1473

v11:
chinese | mixed | ascii | mixed2
---------+-------+-------+--------
1015 | 641 | 102 | 1636

v12:
chinese | mixed | ascii | mixed2
---------+-------+-------+--------
964 | 629 | 168 | 1069

v13:
chinese | mixed | ascii | mixed2
---------+-------+-------+--------
954 | 643 | 202 | 1046

v13 is not that much different from v12, but has the nice property of
simpler code. Both are not as nice as v11 for ascii, but don't regress for
the latter's worst case. I'm leaning towards v13 for the fallback.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v13-0001-Rewrite-pg_utf8_verifystr-for-speed.patch application/octet-stream 12.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-06-29 11:41:34 Use PG_STAT_GET_REPLICATION_SLOT_COLS in pg_stat_get_replication_slot()
Previous Message Dean Rasheed 2021-06-29 11:08:01 Numeric x^y for negative x