From: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improve the performance of Unicode Normalization Forms. |
Date: | 2025-09-02 21:07:00 |
Message-ID: | 7859e5ef-a574-4199-a69b-6fee26711521@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, Jeff, hackers!
As promised, refactoring the C code for Unicode Normalization Forms.
In general terms, here's what has changed:
1. Recursion has been removed; now data is generated using
a Perl script.
2. Memory is no longer allocated for uint32 for the entire size,
but uint8 is allocated for the entire size for the CCC cache, which
boosts performance significantly.
3. The code for the unicode_normalize() function has been completely
rewritten.
I am confident that we have achieved excellent results.
Jeff's test:
Without patch:
Normalization from NFC to NFD with PG: 009.121
Normalization from NFC to NFKD with PG: 009.048
Normalization from NFD to NFC with PG: 014.525
Normalization from NFD to NFKC with PG: 014.380
Whith patch:
Normalization from NFC to NFD with PG: 001.580
Normalization from NFC to NFKD with PG: 001.634
Normalization from NFD to NFC with PG: 002.979
Normalization from NFD to NFKC with PG: 003.050
Test with ICU (with path and ICU):
Normalization from NFC to NFD with PG: 001.580
Normalization from NFC to NFD with ICU: 001.880
Normalization from NFC to NFKD with PG: 001.634
Normalization from NFC to NFKD with ICU: 001.857
Normalization from NFD to NFC with PG: 002.979
Normalization from NFD to NFC with ICU: 001.144
Normalization from NFD to NFKC with PG: 003.050
Normalization from NFD to NFKC with ICU: 001.260
pgbench:
The files were sent via pgbench. The files contain all code points that
need to be normalized.
NFC:
Patch: tps = 9701.568161
Without: tps = 6820.828104
NFD:
Patch: tps = 2707.155148
Without: tps = 1745.949174
NFKC:
Patch: tps = 9893.952804
Without: tps = 6697.358888
NFKD:
Patch: tps = 2580.785909
Without: tps = 1521.058417
To ensure fairness in testing with ICU, I corrected Jeff's patch;
we calculate the size of the final buffer, and I placed ICU in
the same position.
I'm talking about:
Get size:
length = unorm_normalize(u_input, -1, form, 0, NULL, 0, &status);
Normalize:
unorm_normalize(u_input, -1, form, 0, u_result, length, &status);
Otherwise, it turned out that we were giving the ICU some huge buffer,
and it was writing to it.
And we ourselves calculate what buffer we need.
--
Regards,
Alexander Borisov
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch | text/plain | 12.6 KB |
v4-0002-Improve-the-performance-of-Unicode-Normalization-.patch | text/plain | 1.0 MB |
v4-0003-Refactoring-Unicode-Normalization-Forms-performan.patch | text/plain | 1.3 MB |
From | Date | Subject | |
---|---|---|---|
Next Message | Ilia Evdokimov | 2025-09-02 21:07:01 | Re: pull-up subquery if JOIN-ON contains refs to upper-query |
Previous Message | Nathan Bossart | 2025-09-02 21:02:47 | Re: Use bool with synced field (src/include/replication/slot.h) |