Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2025-06-20 14:51:24
Message-ID:	16f87504-f174-450e-93cc-2db1074522bb@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

19.06.2025 20:41, Jeff Davis wrote:
> On Tue, 2025-06-03 at 00:51 +0300, Alexander Borisov wrote:
>> As promised, I continue to improve/speed up Unicode in Postgres.
>> Last time, we improved the lower(), upper(), and casefold()
>> functions. [1]
>> Now it's time for Unicode Normalization Forms, specifically
>> the normalize() function.
>
> Did you compare against other implementations, such as ICU's
> normalization functions? There's also a rust crate here:
>
> https://github.com/unicode-rs/unicode-normalization
>
> that might have been optimized.

I don't quite see how this compares to the implementation on Rust. In
the link provided, they use perfect hash, which I get rid of and get
a x2 boost.
If you take ICU implementations in C++, I have always considered them
slow, at least when used in C code.
I may well run benchmarks and compare the performance of the approach
in Postgres and ICU. But this is beyond the scope of the patches under
discussion.

I want to emphasize that the pachty I gave doesn't change the
normalization code/logic.
We change the approach in finding the right codepoints across tables,
which is what gives us the performance boost.

> In addition to the lookups themselves, there are other opportunities
> for optimization as well, such as:
>
> * reducing the need for palloc and extra buffers, perhaps by using
> buffers on the stack for small strings
>
> * operate more directly on UTF-8 data rather than decoding and re-
> encoding the entire string

Absolutely agree with you, the normalization code is very well written
but far from optimized.
I didn't send changes in the normalization code itself to avoid lumping
everything together and make the review easier.
In keeping with my idea of optimizations in normalization forms, I
planned to discuss the optimization code (C code) in the next iteration
on “Improve performance...”.

--
Regards,
Alexander Borisov

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2025-06-19 17:41:57 from Jeff Davis

Responses

Re: Improve the performance of Unicode Normalization Forms. at 2025-06-20 17:20:08 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-06-20 15:02:23	Re: pgv18: simple table scan take more time than pgv14
Previous Message	Junwang Zhao	2025-06-20 14:45:44	Re: Fixes inconsistent behavior in vacuum when it processes multiple relations