Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Victor Yegorov <vyegorov(at)gmail(dot)com>
To:	Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2025-09-10 18:50:12
Message-ID:	CAGnEbohehx6sty5LFBkXqYKs7sB1qpy2YV1yu=n9X63ereosjQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

ср, 3 сент. 2025 г. в 09:35, Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>:

> Hi, Jeff, hackers!
>
> As promised, refactoring the C code for Unicode Normalization Forms.
>
> In general terms, here's what has changed:
> 1. Recursion has been removed; now data is generated using
> a Perl script.
> 2. Memory is no longer allocated for uint32 for the entire size,
> but uint8 is allocated for the entire size for the CCC cache, which
> boosts performance significantly.
> 3. The code for the unicode_normalize() function has been completely
> rewritten.
>
> I am confident that we have achieved excellent results.
>

Hey.

I've looked into these patches.

Patches apply, compilation succeedes, make check and make installcheck shows
no errors.

Code quality is good, although I suggest a native english speaker to review
comments and commit messages — a bit difficult to follow.

Description of the Sparse Array approach is done in the newly introduced
GenerateSparseArray.pm module. Perhaps it'd be valuable to add a section
into
the src/common/unicode/README, it'll get more visibility.
( Not insisting here. )

For performance testing I've used an approach by Jeff Davis. [1]
I've prepared NFC and NFD files, loaded them into UNLOGGED tables and
measured
normalize() calls.

CREATE UNLOGGED TABLE strings_nfd (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfd FROM '/var/lib/postgresql/strings.nfd.txt';

CREATE UNLOGGED TABLE strings_nfc (
str text STORAGE PLAIN NOT NULL
);
COPY strings_nfc FROM '/var/lib/postgresql/strings.nfc.txt';

SELECT count( normalize( str, NFD ) ) FROM strings_nfd,
generate_series( 1, 10 ) x;
SELECT count( normalize( str, NFC ) ) FROM strings_nfc,
generate_series( 1, 10 ) x;

And I've got the following numbers:

Master
NFD Time: 2954.630 ms / 295ms
NFC Time: 3929.939 ms / 330ms

Patched
NFD Time: 1658.345 ms / 166ms / +78%
NFC Time: 1862.757 ms / 186ms / +77%

Overall, I find these patches and performance very nice and valuable.
I've added myself as a reviewer and marked this patch as Ready for
Committer.

[1]
https://postgr.es/m/adffa1fbdb867d5a11c9a8211cde3bdb1e208823.camel@j-davis.com

--
Victor Yegorov

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2025-09-02 21:07:00 from Alexander Borisov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Zsolt Parragi	2025-09-10 18:50:17	Re: OAuth client code doesn't work with Google OAuth
Previous Message	Marcos Pegoraro	2025-09-10 18:28:49	Re: [PATCH] Generate random dates/times in a specified range