| From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
|---|---|
| To: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> |
| Cc: | Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Improve the performance of Unicode Normalization Forms. |
| Date: | 2025-12-22 16:24:53 |
| Message-ID: | f584418b-c704-4308-b1cb-b667d99c0c6f@iki.fi |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 22/12/2025 17:44, Alexander Borisov wrote:
> The branching kept bothering me. The algorithm consisted of a function
> with branches (essentially a binary search with offsets) and a table of
> indexes. I really wanted to get rid of the branches. And then it dawned
> on me that instead of building branches with offsets, I could create
> another table that would contain these offsets.
> The "inspiration" came from my patch Optimization of the is_normalized()
> function [0].
>
> Actually, I changed the GenerateSparseArray algorithm, which now
> generates a table with offsets instead of a function with branches.
> In other words, we now have two tables:
> 1. A table with offsets, indicating what offset is needed to obtain the
> desired index from the index table.
> 2. A table with indexes. The data is divided into fixed ranges.
Sounds like you re-invented the radix tree :-). Nothing wrong with that;
a radix tree sounds like a good data structure for this.
> > The GenerateSparseArray adds comments like "/* U+1234 */" to each
> > element. That's nice but it implies that the elements are unicode code
> > points. GenerateSparseArray could be used for many other things. Let's
> > use "/* 0x1234 */" instead, or make it somehow configurable.
> >
> > The generated file is very large, over 1 MB. I guess it doesn't matter
> > all that much, but perhaps we should generate a little more compact
> > code. Maybe we don't need the "/* U+1234 */" comment on every line, but
> > only one comment for each range, for example.
>
> I have many places where I need to generate C tables. To make my life
> easier, and perhaps the lives of others, I added a small module for
> generating "pretty" tables. ./src/tools/PrettyLine.pm
> After applying it, the size of the generated header file was reduced
> to ~300KB.
I think a fixed number of values per line would look nicer than trying
to pack as much possible on each line.
(This isn't too important of course, this is just about making the
generated code look prettier.)
Thanks, I'll try to find time to take a closer look.
- Heikki
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Yugo Nagata | 2025-12-22 17:15:09 | psql: Fix tab completion for VACUUM option values |
| Previous Message | Dilip Kumar | 2025-12-22 15:41:03 | Re: Proposal: Conflict log history table for Logical Replication |