From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improve the performance of Unicode Normalization Forms. |
Date: | 2025-08-08 23:17:49 |
Message-ID: | adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
> Version 3 patches. In version 2 "make -s headerscheck" did not work.
I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:
cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfc > ~/strings.nfc.txt
cat icu4j/perf-tests/data/collation/Test* \
| uconv -f utf-8 -t utf-8 -x nfd > ~/strings.nfd.txt
export NORM_PERF_NFC_FILE=~/strings.nfc.txt
export NORM_PERF_NFD_FILE=~/strings.nfd.txt
The first is about 8MB, the second 9MB (because NFD is slightly
larger).
Then I added some testing code to norm_test.c. It's not intended for
committing, just to run the test. Note that it requires setting
environment variables to find the input files.
If patch v3j-0001 are applied, it's using perfect hashing. If patches
v3j-0002-4 are applied, it's using your code. In either case it
compares with ICU.
Results with perfect hashing (100 iterations):
Normalization from NFC to NFD with PG: 010.009
Normalization from NFC to NFD with ICU: 001.580
Normalization from NFC to NFKD with PG: 009.376
Normalization from NFC to NFKD with ICU: 000.857
Normalization from NFD to NFC with PG: 016.026
Normalization from NFD to NFC with ICU: 001.205
Normalization from NFD to NFKC with PG: 015.903
Normalization from NFD to NFKC with ICU: 000.654
Results with your code (100 iterations):
Normalization from NFC to NFD with PG: 004.626
Normalization from NFC to NFD with ICU: 001.577
Normalization from NFC to NFKD with PG: 004.024
Normalization from NFC to NFKD with ICU: 000.861
Normalization from NFD to NFC with PG: 006.846
Normalization from NFD to NFC with ICU: 001.209
Normalization from NFD to NFKC with PG: 006.655
Normalization from NFD to NFKC with ICU: 000.651
Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.
Regards,
Jeff Davis
Attachment | Content-Type | Size |
---|---|---|
v3j-0001-Performance-testing-infrastructure-for-normaliza.patch | text/x-patch | 8.6 KB |
v3j-0002-Undo-perfect-hash-changes.patch | text/x-patch | 2.2 KB |
v3j-0003-Moving-Perl-functions-Range-index-to-a-common-mo.patch | text/x-patch | 12.4 KB |
v3j-0004-Improve-the-performance-of-Unicode-Normalization.patch | text/x-patch | 1.0 MB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-08-08 23:20:14 | Re: headerscheck warnings with late-model gcc |
Previous Message | Dagfinn Ilmari Mannsåker | 2025-08-08 22:55:33 | Re: Improve tab completion for various SET/RESET forms |