Re: Improve the performance of Unicode Normalization Forms.

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve the performance of Unicode Normalization Forms.
Date: 2025-08-08 23:17:49
Message-ID: adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
> Version 3 patches. In version 2 "make -s headerscheck" did not work.

I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:

cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfc > ~/strings.nfc.txt

cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfd > ~/strings.nfd.txt

export NORM_PERF_NFC_FILE=~/strings.nfc.txt
export NORM_PERF_NFD_FILE=~/strings.nfd.txt

The first is about 8MB, the second 9MB (because NFD is slightly
larger).

Then I added some testing code to norm_test.c. It's not intended for
committing, just to run the test. Note that it requires setting
environment variables to find the input files.

If patch v3j-0001 are applied, it's using perfect hashing. If patches
v3j-0002-4 are applied, it's using your code. In either case it
compares with ICU.

Results with perfect hashing (100 iterations):

Normalization from NFC to NFD with PG: 010.009
Normalization from NFC to NFD with ICU: 001.580
Normalization from NFC to NFKD with PG: 009.376
Normalization from NFC to NFKD with ICU: 000.857
Normalization from NFD to NFC with PG: 016.026
Normalization from NFD to NFC with ICU: 001.205
Normalization from NFD to NFKC with PG: 015.903
Normalization from NFD to NFKC with ICU: 000.654

Results with your code (100 iterations):

Normalization from NFC to NFD with PG: 004.626
Normalization from NFC to NFD with ICU: 001.577
Normalization from NFC to NFKD with PG: 004.024
Normalization from NFC to NFKD with ICU: 000.861
Normalization from NFD to NFC with PG: 006.846
Normalization from NFD to NFC with ICU: 001.209
Normalization from NFD to NFKC with PG: 006.655
Normalization from NFD to NFKC with ICU: 000.651

Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.

Regards,
Jeff Davis

Attachment Content-Type Size
v3j-0001-Performance-testing-infrastructure-for-normaliza.patch text/x-patch 8.6 KB
v3j-0002-Undo-perfect-hash-changes.patch text/x-patch 2.2 KB
v3j-0003-Moving-Perl-functions-Range-index-to-a-common-mo.patch text/x-patch 12.4 KB
v3j-0004-Improve-the-performance-of-Unicode-Normalization.patch text/x-patch 1.0 MB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-08-08 23:20:14 Re: headerscheck warnings with late-model gcc
Previous Message Dagfinn Ilmari Mannsåker 2025-08-08 22:55:33 Re: Improve tab completion for various SET/RESET forms