Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2025-08-11 14:21:04
Message-ID:	cffe5df6-b2de-4f72-ad22-023208409aa5@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

09.08.2025 02:17, Jeff Davis пишет:
> On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
>> Version 3 patches. In version 2 "make -s headerscheck" did not work.
>
> I ran my own performance tests. What I did was get some test data from
> ICU v76.1 by doing:
>

[..]

> Results with perfect hashing (100 iterations):
>
> Normalization from NFC to NFD with PG: 010.009
> Normalization from NFC to NFD with ICU: 001.580
> Normalization from NFC to NFKD with PG: 009.376
> Normalization from NFC to NFKD with ICU: 000.857
> Normalization from NFD to NFC with PG: 016.026
> Normalization from NFD to NFC with ICU: 001.205
> Normalization from NFD to NFKC with PG: 015.903
> Normalization from NFD to NFKC with ICU: 000.654
>
> Results with your code (100 iterations):
>
> Normalization from NFC to NFD with PG: 004.626
> Normalization from NFC to NFD with ICU: 001.577
> Normalization from NFC to NFKD with PG: 004.024
> Normalization from NFC to NFKD with ICU: 000.861
> Normalization from NFD to NFC with PG: 006.846
> Normalization from NFD to NFC with ICU: 001.209
> Normalization from NFD to NFKC with PG: 006.655
> Normalization from NFD to NFKC with ICU: 000.651
>
> Your patches are a major improvement, but I'm trying to figure out why
> ICU still wins by so much. Thoughts? I didn't investigate much myself
> yet, so it's quite possible there's a bug in my test or something.

I had some experimented a bit, and took a look at the ICU code.

1. The performance test is not entirely accurate.

The thing is that in ICU (unorm_normalize()), we pass the output
buffer and its size.
That is, ICU does not calculate how much data we will have at the
output; we pass it our buffer. ICU simply checks whether the data
fits into the buffer or not.

In our case, PG (unicode_normalize()) only accepts the input buffer
and first calculates the size of the output buffer.
This means we are doing double the work, as we have already gone
through the input data at least once too many times.

Here, it would be more honest to call the function for calculating
the buffer in ICU, and only then call data normalization.

2. In PG, unnecessary table lookups (get_code_entry()) that can
clearly be avoided.
3. In PG, the algorithm is not optimized.

We could eliminate the decompose_code() function, which is called
for each code point.
In this function, we can remove recursion and prepare the data in
advance. That is, we can perform data expansion in the Perl script.
If we do this, we will have code that is in place and without recursion.
And then there are all sorts of other minor details.

Of course, there are other approaches.

I did this comparison (your test, Jeff):

1. Without patch.

Normalization from NFC to NFD with PG: 009.121
Normalization from NFC to NFD with ICU: 000.954
Normalization from NFC to NFKD with PG: 009.048
Normalization from NFC to NFKD with ICU: 000.965
Normalization from NFD to NFC with PG: 014.525
Normalization from NFD to NFC with ICU: 000.623
Normalization from NFD to NFKC with PG: 014.380
Normalization from NFD to NFKC with ICU: 000.666

2. With patch.

Normalization from NFC to NFD with PG: 005.743
Normalization from NFC to NFD with ICU: 001.005
Normalization from NFC to NFKD with PG: 005.902
Normalization from NFC to NFKD with ICU: 000.963
Normalization from NFD to NFC with PG: 007.837
Normalization from NFD to NFC with ICU: 000.640
Normalization from NFD to NFKC with PG: 008.036
Normalization from NFD to NFKC with ICU: 000.667

3. With a patch, but! With direct access to tables, i.e., I created one
large table (index table, unit16_t) where there is no search, direct
access to data.

Normalization from NFC to NFD with PG: 002.792
Normalization from NFC to NFD with ICU: 000.953
Normalization from NFC to NFKD with PG: 002.865
Normalization from NFC to NFKD with ICU: 000.958
Normalization from NFD to NFC with PG: 004.361
Normalization from NFD to NFC with ICU: 000.651
Normalization from NFD to NFKC with PG: 004.474
Normalization from NFD to NFKC with ICU: 000.668

It is evident that direct access provides x2 from the patch, but adds
+1.5MB to the object file size.
This is just to check the time difference in access.

As a result, I would not look into ICU at the moment, especially since
we have a different approach.
I am currently working on optimizing unicode_normalize().
I am trying to come up with an improved version of the algorithm in C
by the next commitfest (which will be in September).

P.S.: In general, I strive to surpass ICU in terms of speed.
I think we're making good progress. Let's see what happens.

--
Regards,
Alexander Borisov

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2025-08-08 23:17:49 from Jeff Davis

Responses

Re: Improve the performance of Unicode Normalization Forms. at 2025-08-13 22:12:17 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Antonin Houska	2025-08-11 14:22:00	Re: Adding REPACK [concurrently]
Previous Message	Tomas Vondra	2025-08-11 14:16:05	Re: index prefetching