Re: Improve the performance of Unicode Normalization Forms.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve the performance of Unicode Normalization Forms.
Date: 2025-06-11 12:27:02
Message-ID: cfd504f7-1fc1-43df-9356-f68818f30921@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

11.06.2025 10:13, John Naylor wrote:
> On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> wrote:
>> 5. The server part "lost weight" in the binary, but the frontend
>> "gained weight" a little.
>>
>> I read the old commits, which say that the size of the frontend is very
>> important and that speed is not important
>> (speed is important on the server).
>> I'm not quite sure what to do if this is really the case. Perhaps
>> we should leave the slow version for the frontend.
>
> In the "small" patch, the frontend files got a few kB bigger, but the
> backend got quite a bit smaller. If we decided to go with this patch,
> I'd say it's preferable to do it in a way that keeps both paths the
> same.

Okay, then I'll leave the frontend unchanged so that the size remains
the same. The changes will only affect the backend.

>> How was it tested?
>> Four files were created for each normalization form: NFC, NFD, NFKC,
>> and NFKD.
>> The files were sent via pgbench. The files contain all code points that
>> need to be normalized.
>> Unfortunately, the patches are already quite large, but if necessary,
>> I can send these files in a separate email or upload them somewhere.
>
> What kind of workload do they present?
> Did you consider running the same tests from the thread that lead to
> the current implementation?

I found performance tests in this discussion
https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
Below are performance test results.

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

1.

Normalize, decomp only

select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 279,858 ms
Patch (small table): 282,925 ms
Without: 444,118 ms

2.

select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 219,858 ms
Patch (small table): 247,893 ms
Without: 376,906 ms

3.

Normalize, decomp+recomp

select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 7,553 ms
Patch (small table): 7,876 ms
Without: 13,177 ms

4.

select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,1000) as i
) s;

Patch (big table): 5,765 ms
Patch (small table): 6,782 ms
Without: 10,800 ms

5.

Quick check has not changed because these patches do not affect it:

-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;

Patch (big table): 29,477 ms
Patch (small table): 29,436 ms
Without: 29,378 ms

From these tests, we see 2x in some tests.

--
Best regards,
Alexander Borisov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-06-11 13:00:59 Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX
Previous Message Junwang Zhao 2025-06-11 11:35:39 Re: Use RELATION_IS_OTHER_TEMP where possible