Re: Improve the performance of Unicode Normalization Forms.

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve the performance of Unicode Normalization Forms.
Date: 2025-12-12 12:22:38
Message-ID: ec5c5854-226b-4105-88e1-efa7dc05fba4@iki.fi
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24/11/2025 11:55, Alexander Borisov wrote:
> Hey guys, any news?

I finally got around to look at this, thanks for your patience!

> v6-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch

+1, this makes the existing code more readable, even without the rest of
the patches. Thanks for the added comments. I'll review this in more
detail, but it seems pretty close to be ready for committing.

Thinking of GenerateSparseArray's API, I think what the caller really
wants is to generate a function like:

/*
* Look up the value of 'x'.
*/
static uint16
lookup(uint16)
{
...
}

That's how the PerfectHash module works. The tables should be
implementation detail that the caller doesn't need to know about.

> v6-0002-Improve-the-performance-of-Unicode-Normalization-.patch
> v6-0003-Refactoring-Unicode-Normalization-Forms-performan.patch

These basically look OK to me too. Some minor issues and ideas for
further work:

The 'make update-unicode' and 'ninja update-unicode' targets are broken.
Need to be updated for the removal of 'unicode_norm_hashfunc.h'.

The GenerateSparseArray adds comments like "/* U+1234 */" to each
element. That's nice but it implies that the elements are unicode code
points. GenerateSparseArray could be used for many other things. Let's
use "/* 0x1234 */" instead, or make it somehow configurable.

The generated file is very large, over 1 MB. I guess it doesn't matter
all that much, but perhaps we should generate a little more compact
code. Maybe we don't need the "/* U+1234 */" comment on every line, but
only one comment for each range, for example.

> typedef struct
> {
> uint8 comb_class; /* combining class of character */
> uint8 dec_size_flags; /* size and flags of decomposition code list */
> uint16 dec_index; /* index into UnicodeDecomp_codepoints, or the
> * decomposition itself if DECOMP_INLINE */
> } pg_unicode_decomposition;

The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn't
exist anymore.

Finally, some ideas for packing the arrays more tightly. I'm not sure if
these make any difference in practice or are worth the effort, but here
we go:

There are only 56 distinct comb_classes, and 18 distinct dec_size_flags,
so if we added one more small lookup table for them, they could be
packed into a single byte. That would shrink pg_unicode_decomposition
from 4 to 3 bytes.

> static const uint8 UnicodeDecompSizes[4931] =

The max value stored here is 7, so you could get away with just 3 bits
per element.

> static const uint16 decomp_map[33752] =

This array consists of multiple ranges, and each range is accessed in a
separate piece of code. We could use multiple arrays, one for each
range, instead of one big array. Some of the ranges only store a small
range of values, so we could use uint8 for them.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jakub Wartak 2025-12-12 12:54:38 Re: [PATCH] Better Performance for PostgreSQL with large INSERTs
Previous Message Zsolt Parragi 2025-12-12 11:49:23 Re: Periodic authorization expiration checks using GoAway message