Re: Improve the performance of Unicode Normalization Forms.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve the performance of Unicode Normalization Forms.
Date: 2025-06-24 15:20:39
Message-ID: 677cde50-6d64-474b-9ba8-bab380a111b3@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

20.06.2025 20:20, Jeff Davis wrote:
> On Fri, 2025-06-20 at 17:51 +0300, Alexander Borisov wrote:
>> I don't quite see how this compares to the implementation on Rust. In
>> the link provided, they use perfect hash, which I get rid of and get
>> a x2 boost.
>> If you take ICU implementations in C++, I have always considered them
>> slow, at least when used in C code.
>> I may well run benchmarks and compare the performance of the approach
>> in Postgres and ICU. But this is beyond the scope of the patches
>> under
>> discussion.
>
> Are you saying that, with these patches, Postgres will offer the
> fastest open-source Unicode normalization? If so, that would be very
> cool.

That's what we're aiming for - to implement the fastest approach.
By applying the proposed patches (two patches) we get the fastest
codepoints search by tables. This is evidenced by the measurements made
here and earlier in the patch for unicode case improvement.

After these patches are compiled, I will improve the C normalization
code directly, optimize it. That's when we can take benchmarks and say
with confidence that we're the best at speed.

> The reason I'm asking is because, if there are multiple open source
> implementations, we should either have the best one, or just borrow
> another one as long as it has a suitable license (perhaps translating
> to C as necessary).

Before getting into this "fight" I studied different approaches to
searching for the necessary codepoints in tables (hash, binary search,
radix trees...) and came to the conclusion that the approach I proposed
(range index) is the fastest for this purpose.

--
Regards,
Alexander Borisov

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2025-06-24 15:26:12 pg_dumpall dumps global objects with --statistics-only or --no-schema
Previous Message Tom Lane 2025-06-24 15:14:37 Re: BackendKeyData is mandatory?