From: | Victor Yegorov <vyegorov(at)gmail(dot)com> |
---|---|
To: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> |
Cc: | Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improve the performance of Unicode Normalization Forms. |
Date: | 2025-09-10 18:50:12 |
Message-ID: | CAGnEbohehx6sty5LFBkXqYKs7sB1qpy2YV1yu=n9X63ereosjQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
ср, 3 сент. 2025 г. в 09:35, Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>:
> Hi, Jeff, hackers!
>
> As promised, refactoring the C code for Unicode Normalization Forms.
>
> In general terms, here's what has changed:
> 1. Recursion has been removed; now data is generated using
> a Perl script.
> 2. Memory is no longer allocated for uint32 for the entire size,
> but uint8 is allocated for the entire size for the CCC cache, which
> boosts performance significantly.
> 3. The code for the unicode_normalize() function has been completely
> rewritten.
>
> I am confident that we have achieved excellent results.
>
Hey.
I've looked into these patches.
Patches apply, compilation succeedes, make check and make installcheck shows
no errors.
Code quality is good, although I suggest a native english speaker to review
comments and commit messages — a bit difficult to follow.
Description of the Sparse Array approach is done in the newly introduced
GenerateSparseArray.pm module. Perhaps it'd be valuable to add a section
into
the src/common/unicode/README, it'll get more visibility.
( Not insisting here. )
For performance testing I've used an approach by Jeff Davis. [1]
I've prepared NFC and NFD files, loaded them into UNLOGGED tables and
measured
normalize() calls.
CREATE UNLOGGED TABLE strings_nfd (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';
CREATE UNLOGGED TABLE strings_nfc (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';
SELECT count( normalize( str, NFD ) ) FROM strings_nfd,
generate_series( 1, 10 ) x;
SELECT count( normalize( str, NFC ) ) FROM strings_nfc,
generate_series( 1, 10 ) x;
And I've got the following numbers:
Master
NFD Time: 2954.630 ms / 295ms
NFC Time: 3929.939 ms / 330ms
Patched
NFD Time: 1658.345 ms / 166ms / +78%
NFC Time: 1862.757 ms / 186ms / +77%
Overall, I find these patches and performance very nice and valuable.
I've added myself as a reviewer and marked this patch as Ready for
Committer.
[1]
https://postgr.es/m/adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com
--
Victor Yegorov
From | Date | Subject | |
---|---|---|---|
Next Message | Zsolt Parragi | 2025-09-10 18:50:17 | Re: OAuth client code doesn't work with Google OAuth |
Previous Message | Marcos Pegoraro | 2025-09-10 18:28:49 | Re: [PATCH] Generate random dates/times in a specified range |