Re: Improve the performance of Unicode Normalization Forms.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve the performance of Unicode Normalization Forms.
Date: 2025-09-02 21:07:00
Message-ID: 7859e5ef-a574-4199-a69b-6fee26711521@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Jeff, hackers!

As promised, refactoring the C code for Unicode Normalization Forms.

In general terms, here's what has changed:
1. Recursion has been removed; now data is generated using
a Perl script.
2. Memory is no longer allocated for uint32 for the entire size,
but uint8 is allocated for the entire size for the CCC cache, which
boosts performance significantly.
3. The code for the unicode_normalize() function has been completely
rewritten.

I am confident that we have achieved excellent results.

Jeff's test:
Without patch:
Normalization from NFC to NFD with PG: 009.121
Normalization from NFC to NFKD with PG: 009.048
Normalization from NFD to NFC with PG: 014.525
Normalization from NFD to NFKC with PG: 014.380

Whith patch:
Normalization from NFC to NFD with PG: 001.580
Normalization from NFC to NFKD with PG: 001.634
Normalization from NFD to NFC with PG: 002.979
Normalization from NFD to NFKC with PG: 003.050

Test with ICU (with path and ICU):
Normalization from NFC to NFD with PG: 001.580
Normalization from NFC to NFD with ICU: 001.880
Normalization from NFC to NFKD with PG: 001.634
Normalization from NFC to NFKD with ICU: 001.857

Normalization from NFD to NFC with PG: 002.979
Normalization from NFD to NFC with ICU: 001.144
Normalization from NFD to NFKC with PG: 003.050
Normalization from NFD to NFKC with ICU: 001.260

pgbench:
The files were sent via pgbench. The files contain all code points that
need to be normalized.

NFC:
Patch: tps = 9701.568161
Without: tps = 6820.828104

NFD:
Patch: tps = 2707.155148
Without: tps = 1745.949174

NFKC:
Patch: tps = 9893.952804
Without: tps = 6697.358888

NFKD:
Patch: tps = 2580.785909
Without: tps = 1521.058417

To ensure fairness in testing with ICU, I corrected Jeff's patch;
we calculate the size of the final buffer, and I placed ICU in
the same position.

I'm talking about:
Get size:
length = unorm_normalize(u_input, -1, form, 0, NULL, 0, &status);
Normalize:
unorm_normalize(u_input, -1, form, 0, u_result, length, &status);

Otherwise, it turned out that we were giving the ICU some huge buffer,
and it was writing to it.
And we ourselves calculate what buffer we need.

--
Regards,
Alexander Borisov

Attachment Content-Type Size
v4-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch text/plain 12.6 KB
v4-0002-Improve-the-performance-of-Unicode-Normalization-.patch text/plain 1.0 MB
v4-0003-Refactoring-Unicode-Normalization-Forms-performan.patch text/plain 1.3 MB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ilia Evdokimov 2025-09-02 21:07:01 Re: pull-up subquery if JOIN-ON contains refs to upper-query
Previous Message Nathan Bossart 2025-09-02 21:02:47 Re: Use bool with synced field (src/include/replication/slot.h)