Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2025-08-13 22:12:17
Message-ID:	f3fee95b221846b7ab6b08ddfad039518c6aa97b.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote:
> As a result, I would not look into ICU at the moment, especially
> since
> we have a different approach.
> I am currently working on optimizing unicode_normalize().
> I am trying to come up with an improved version of the algorithm in C
> by the next commitfest (which will be in September).

Agreed, but thank you for adding context so I can understand where we
are.

The patch as proposed is a speedup, and also a simplification because
it eliminates the different code path for the frontend code. That also
makes me feel better about testing, because I don't think both those
paths were tested equally.

Comments on the patch itself:

The 0001 patch generalizes the two-step lookup process: first navigate
branches to find the index into a partially-compacted sparse array, and
then use that to get the index into the dense array. The branching
code, the sparse array, and the dense array are all generated code. The
reason for the two-step lookup is to keep the sparse array element size
small (uint16), so that missing elements consume less space (missing
elements still remain for small gaps). The full entry is kept in the
dense array.

GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for
the new module. And we should probably include "sparse" in the name of
the sparse arrays.

The new module is responsible for generating the branching code as well
as the sparse array; while the caller is reponsible for the dense
arrays. For case mapping, one sparse array is used for four parallel
arrays for the different case kinds (lower/title/upper/fold).

The use of zero values for different purposes is getting confusing. It
means "doesn't exist", but we are also reserving the zeroth element in
the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then
have the caller check for it? That way we don't need to reserve the
zeroth array element, which should make it easier to avoid off-by-one
errors.

I think we can simplify the interface, as well. Why does the caller
need to separately generate the ranges, then generate the table, then
generate the branches? It's really all the same action and can be based
on an input hash with a certain structure, and then return both the
table and the branches, right?

Regards,
Jeff Davis

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2025-08-11 14:21:04 from Alexander Borisov

Responses

Re: Improve the performance of Unicode Normalization Forms. at 2025-08-14 14:41:42 from Alexander Borisov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2025-08-13 22:23:49	Re: index prefetching
Previous Message	Cary Huang	2025-08-13 22:03:06	Re: Support tid range scan in parallel?