Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2025-12-22 17:55:28
Message-ID:	7b76973b-e9b6-4f80-a1d2-4bd467e38346@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

22.12.2025 19:24, Heikki Linnakangas wrote:
> On 22/12/2025 17:44, Alexander Borisov wrote:
>> The branching kept bothering me. The algorithm consisted of a function
>> with branches (essentially a binary search with offsets) and a table of
>> indexes. I really wanted to get rid of the branches. And then it dawned
>> on me that instead of building branches with offsets, I could create
>> another table that would contain these offsets.
>> The "inspiration" came from my patch Optimization of the is_normalized()
>> function [0].
>>
>> Actually, I changed the GenerateSparseArray algorithm, which now
>> generates a table with offsets instead of a function with branches.
>> In other words, we now have two tables:
>> 1. A table with offsets, indicating what offset is needed to obtain the
>> desired index from the index table.
>> 2. A table with indexes. The data is divided into fixed ranges.
>
> Sounds like you re-invented the radix tree :-). Nothing wrong with that;
> a radix tree sounds like a good data structure for this.

I looked at the implementation of radix trees for Encoding in
PostgreSQL. There are similarities, of course, but in my implementation,
the access speed is literally O(1).

For example:
return table_index[ table_offset[cp >> 6] + (cp & 63) ];

Access speed in Encoding (radix tree), it is more complicated. If I
understand correctly.

In general, there was an idea to store offset and index data in one
table. Not to separate them.
But I'm sure you'll understand everything when you see the code.

>> > The GenerateSparseArray adds comments like "/* U+1234 */" to each
>> > element. That's nice but it implies that the elements are unicode code
>> > points. GenerateSparseArray could be used for many other things. Let's
>> > use "/* 0x1234 */" instead, or make it somehow configurable.
>> >
>> > The generated file is very large, over 1 MB. I guess it doesn't matter
>> > all that much, but perhaps we should generate a little more compact
>> > code. Maybe we don't need the "/* U+1234 */" comment on every line,
>> but
>> > only one comment for each range, for example.
>>
>> I have many places where I need to generate C tables. To make my life
>> easier, and perhaps the lives of others, I added a small module for
>> generating "pretty" tables. ./src/tools/PrettyLine.pm
>> After applying it, the size of the generated header file was reduced
>> to ~300KB.
>
> I think a fixed number of values per line would look nicer than trying
> to pack as much possible on each line.
>
> (This isn't too important of course, this is just about making the
> generated code look prettier.)

With a sense of inner beauty :), when the number of values in a line is
fixed, the lines start to "dance", which doesn't look very good.
On the other hand, we can convert all values to %06X or something like
that. But then it would defeat the whole purpose of the idea — to reduce
the file size and make it look nice.
I use this approach to create many tables, which allows me not to think
about how the numbers will fit in and not waste time on it.
But, it's more a matter of taste.

> Thanks, I'll try to find time to take a closer look.

Thank you for your time.

--
Regards,
Alexander Borisov

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2025-12-22 16:24:53 from Heikki Linnakangas

Responses

Re: Improve the performance of Unicode Normalization Forms. at 2026-01-12 13:24:37 from Alexander Borisov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Melanie Plageman	2025-12-22 17:57:16	Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Previous Message	Laurenz Albe	2025-12-22 17:44:43	Re: Get rid of "Section.N.N.N" on DOCs