speed up unicode decomposition and recomposition

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: speed up unicode decomposition and recomposition
Date: 2020-10-14 16:58:21
Message-ID: CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Having committed the optimization for unicode normalization quick check,
Michael Paquier suggested I might do the same for decomposition as well. I
wrote:

> I'll
> do some performance testing soon. Note that a 25kB increase in size could
> be present in frontend binaries as well in this case. While looking at
> decomposition, I noticed that recomposition does a linear search through
> all 6600+ entries, although it seems only about 800 are valid for that.
> That could be optimized as well now, since with hashing we have more
> flexibility in the ordering and can put the recomp-valid entries in front.
> I'm not yet sure if it's worth the additional complexity. I'll take a look
> and start a new thread.

The attached patch uses a perfect hash for codepoint decomposition, and for
recomposing reduces the linear search from 6604 entries to 942.

The performance is very nice, and if I'd known better I would have done
this first, since the decomp array is as big as the two quick check arrays
put together:

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

master patchÏ
887ms 231ms

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;

master patch
1110ms 208ms

Normalize, decomp+recomp (note: 100x less data)

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

master patch
194ms 50.6ms

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,1000) as i
) s;

master patch
137ms 39.4ms

Quick check is another 2x faster on top of previous gains, since it tests
canonical class via the decomposition array:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s
where t is NFC normalized;

master patch
296ms 131ms

Some other considerations:
- As I alluded above, this adds ~26kB to libpq because of SASLPrep. Since
the decomp array was reordered to optimize linear search, it can no longer
be used for binary search. It's possible to build two arrays, one for
frontend and one for backend, but that's additional complexity. We could
also force frontend to do a linear search all the time, but that seems
foolish. I haven't checked if it's possible to exclude the hash from
backend's libpq.
- I could split out the two approaches into separate patches, but it'd be
rather messy.

I'll add a CF entry for this.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v1-0001-Optimize-unicode-decomposition-and-recomposition.patch application/octet-stream 186.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-10-14 17:06:40 Re: speed up unicode decomposition and recomposition
Previous Message Fujii Masao 2020-10-14 16:41:34 Re: Global snapshots