From: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improve the performance of Unicode Normalization Forms. |
Date: | 2025-06-11 12:27:02 |
Message-ID: | cfd504f7-1fc1-43df-9356-f68818f30921@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
11.06.2025 10:13, John Naylor wrote:
> On Tue, Jun 3, 2025 at 1:51 PM Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> wrote:
>> 5. The server part "lost weight" in the binary, but the frontend
>> "gained weight" a little.
>>
>> I read the old commits, which say that the size of the frontend is very
>> important and that speed is not important
>> (speed is important on the server).
>> I'm not quite sure what to do if this is really the case. Perhaps
>> we should leave the slow version for the frontend.
>
> In the "small" patch, the frontend files got a few kB bigger, but the
> backend got quite a bit smaller. If we decided to go with this patch,
> I'd say it's preferable to do it in a way that keeps both paths the
> same.
Okay, then I'll leave the frontend unchanged so that the size remains
the same. The changes will only affect the backend.
>> How was it tested?
>> Four files were created for each normalization form: NFC, NFD, NFKC,
>> and NFKD.
>> The files were sent via pgbench. The files contain all code points that
>> need to be normalized.
>> Unfortunately, the patches are already quite large, but if necessary,
>> I can send these files in a separate email or upload them somewhere.
>
> What kind of workload do they present?
> Did you consider running the same tests from the thread that lead to
> the current implementation?
I found performance tests in this discussion
https://www.postgresql.org/message-id/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
Below are performance test results.
* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
1.
Normalize, decomp only
select count(normalize(t, NFD)) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
Patch (big table): 279,858 ms
Patch (small table): 282,925 ms
Without: 444,118 ms
2.
select count(normalize(t, NFD)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,100000) as i
) s;
Patch (big table): 219,858 ms
Patch (small table): 247,893 ms
Without: 376,906 ms
3.
Normalize, decomp+recomp
select count(normalize(t, NFC)) from (
select md5(i::text) as t from
generate_series(1,1000) as i
) s;
Patch (big table): 7,553 ms
Patch (small table): 7,876 ms
Without: 13,177 ms
4.
select count(normalize(t, NFC)) from (
select repeat(U&'\00E4\00C5\0958\00F4\1EBF\3300\1FE2\3316\2465\322D', i % 3
+ 1) as t from
generate_series(1,1000) as i
) s;
Patch (big table): 5,765 ms
Patch (small table): 6,782 ms
Without: 10,800 ms
5.
Quick check has not changed because these patches do not affect it:
-- all chars are quickcheck YES
select count(*) from (
select md5(i::text) as t from
generate_series(1,100000) as i
) s;
Patch (big table): 29,477 ms
Patch (small table): 29,436 ms
Without: 29,378 ms
From these tests, we see 2x in some tests.
--
Best regards,
Alexander Borisov
From | Date | Subject | |
---|---|---|---|
Next Message | Sami Imseih | 2025-06-11 13:00:59 | Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX |
Previous Message | Junwang Zhao | 2025-06-11 11:35:39 | Re: Use RELATION_IS_OTHER_TEMP where possible |