Re: speed up verifying UTF-8

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-06-30 11:18:32
Message-ID: 2f95e70d-4623-87d4-9f24-ca534155f179@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/06/2021 14:20, John Naylor wrote:
> I still wasn't quite happy with the churn in the regression tests, so
> for v13 I gave up on using both the existing utf8 table and my new one
> for the "padded input" tests, and instead just copied the NUL byte test
> into the new table. Also added a primary key to make sure the padded
> test won't give weird results if a new entry has a duplicate description.
>
> I came up with "highbit_carry" as a more descriptive variable name than
> "x", but that doesn't matter a whole lot.
>
> It also occurred to me that if we're going to check one 8-byte chunk at
> a time (like v12 does), maybe it's only worth it to load 8 bytes at a
> time. An earlier version did this, but without the recent tweaks. The
> worst-case scenario now might be different from the one with 16-bytes,
> but for now just tested the previous worst case (mixed2).

I tested the new worst case scenario on my laptop:

gcc master:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1311 | 758 | 405 | 583 | 725

gcc v13:

chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
956 | 472 | 160 | 572 | 939

mixed16 is the same as "mixed2" in the previous rounds, with
'123456789012345ä' as the repeating string, and mixed8 uses '1234567ä',
which I believe is the worst case for patch v13. So v13 is somewhat
slower than master in the worst case.

Hmm, there's one more simple trick we can do: We can have a separate
fast-path version of the loop when there are at least 8 bytes of input
left, skipping all the length checks. With that:

gcc v14:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
737 | 412 | 94 | 476 | 725

All the above numbers were with gcc 10.2.1. For completeness, with clang
11.0.1-2 I got:

clang master:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
1044 | 724 | 403 | 930 | 603
(1 row)

clang v13:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
596 | 445 | 79 | 417 | 715
(1 row)

clang v14:
chinese | mixed | ascii | mixed16 | mixed8
---------+-------+-------+---------+--------
600 | 337 | 93 | 318 | 511

Attached is patch v14 with that optimization. It needs some cleanup, I
just hacked it up quickly for performance testing.

- Heikki

Attachment Content-Type Size
v14-0001-Rewrite-pg_utf8_verifystr-for-speed.patch text/x-patch 13.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Lepikhov 2021-06-30 11:21:15 Re: Removing unneeded self joins
Previous Message David Rowley 2021-06-30 11:14:15 Re: Use simplehash.h instead of dynahash in SMgr