Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2025-12-22 16:24:53
Message-ID:	f584418b-c704-4308-b1cb-b667d99c0c6f@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 22/12/2025 17:44, Alexander Borisov wrote:
> The branching kept bothering me. The algorithm consisted of a function
> with branches (essentially a binary search with offsets) and a table of
> indexes. I really wanted to get rid of the branches. And then it dawned
> on me that instead of building branches with offsets, I could create
> another table that would contain these offsets.
> The "inspiration" came from my patch Optimization of the is_normalized()
> function [0].
>
> Actually, I changed the GenerateSparseArray algorithm, which now
> generates a table with offsets instead of a function with branches.
> In other words, we now have two tables:
> 1. A table with offsets, indicating what offset is needed to obtain the
> desired index from the index table.
> 2. A table with indexes. The data is divided into fixed ranges.

Sounds like you re-invented the radix tree :-). Nothing wrong with that;
a radix tree sounds like a good data structure for this.

> > The GenerateSparseArray adds comments like "/* U+1234 */" to each
> > element. That's nice but it implies that the elements are unicode code
> > points. GenerateSparseArray could be used for many other things. Let's
> > use "/* 0x1234 */" instead, or make it somehow configurable.
> >
> > The generated file is very large, over 1 MB. I guess it doesn't matter
> > all that much, but perhaps we should generate a little more compact
> > code. Maybe we don't need the "/* U+1234 */" comment on every line, but
> > only one comment for each range, for example.
>
> I have many places where I need to generate C tables. To make my life
> easier, and perhaps the lives of others, I added a small module for
> generating "pretty" tables. ./src/tools/PrettyLine.pm
> After applying it, the size of the generated header file was reduced
> to ~300KB.

I think a fixed number of values per line would look nicer than trying
to pack as much possible on each line.

(This isn't too important of course, this is just about making the
generated code look prettier.)

Thanks, I'll try to find time to take a closer look.

- Heikki

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2025-12-12 12:22:38 from Heikki Linnakangas

Responses

Re: Improve the performance of Unicode Normalization Forms. at 2025-12-22 17:55:28 from Alexander Borisov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Yugo Nagata	2025-12-22 17:15:09	psql: Fix tab completion for VACUUM option values
Previous Message	Dilip Kumar	2025-12-22 15:41:03	Re: Proposal: Conflict log history table for Logical Replication