Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Henson Choi <assam258(at)gmail(dot)com>
Cc:	lex(dot)borisov(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql(at)j-davis(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2026-07-03 07:46:09
Message-ID:	akdowYMo4GCvjQub@paquier.xyz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jul 01, 2026 at 03:29:25PM +0900, Henson Choi wrote:
> Two things kept it hidden: no existing test touches this block (so cfbot
> stays green), and even a test that does won't fail without a bounds
> check -- the over-read silently returns 0 and happens to give the right
> answer.
>
> Reproduction: the attached test-only patch (nocfbot-unicode-norm-oob.txt)
> adds bounds assertions to the two lookups plus regression tests that hit
> the boundary. On an --enable-cassert build,
>
> SELECT normalize(U&'\00C5\+02FA1E', NFD);
> SELECT normalize(U&'\+016D6A\0301', NFC);
>
> abort the backend (TRAP: failed Assert("offset + (cp & 63) <
> lengthof(...)")).

That's a good find (if you've used some AI tooling here, good use). I
don't see an immediate need for in-core checks. One can also enable
ubsan to get the same effect with out-of-bound lookups reusing your
queries:
../../src/include/common/unicode_norm_table.h:6524:27: runtime error:
index 17438 out of bounds for type 'uint16 [17438]'

The tests would be more important. If we have a gap in these, this
usually points to a first patch where we add more tests. Please note
that you may not actually need more tests. If I run the test module
test_saslprep with PG_TEST_EXTRA=saslprep to make the costly TAP test
run, I get two ubsan reports of this kind. Note that I enable it in
some of my buildfarm machines, so it would not go unnoticed.

> This extends the tables to 17472/3776, leaves the reader unchanged, and
> resolves those codepoints to the dummy 0 entry (NULL), which is correct.
> The generator change and the regenerated unicode_norm_table.h go
> together. I'd keep the bounds assertion in the lookup as a guard against
> future under-padding.

This kind of issue is bad. Normalization is applied on the passwords
for SCRAM with SASLprep, meaning this patch would touch code that's
taken before authentication. This would lead to an immediate CVE, and
most likely one with a bad score. We need to be extremely careful
with what we are dealing here. SCRAM is used as default
authentication method for a lot of cloud providers, AFAIK.

Saying that (and still trying to figure out the whole code), the
numbers you have published in terms of reduction of the size of the
binaries and the runtimes, based on Jeff's or your methods are
impressive, seeing what has been posted. I would not underestimate
the cost of the two-pass method we currently use when applying the
normalization (one to calculate the output size, one to do the
normalization). Hence why don't we just do that first and evaluate
the gain in run time that we get? One thing that I have been
wondering why reading this thread is simply: why don't we use a
BinaryStringInfo and start the normalization with a sensible
preallocated size, avoiding the first pass for the sizing of the
output buffer? It should be a simpler change that rewriting all the
tables, and I suspect that it may improve the performance quite a bit,
the longer the strings to normalize. And I also suspect that we
should do that anyway at the end, and do it first.

I am aware of the fact that the patch does some allocation changes,
but I would suggest to split the patch into more reviewable pieces
rather than drop a big chunk of code changes presented as a single
commit, and present one patch for each relevant piece that aims at
improving performance. Something that shows up as a red flag for me
is the fact that the patch has a long list of bullet points. Each
bullet point, in some cases, can point to independent patches worth
doing on their own. Anyway, considering that the simplest pieces may
be better first, I could think of at least:
1) Elimination of the second allocation for the recomposition
(in-place recomposition for the decomps buffers).
2) Merge the two-pass decomposition into a single pass
3) The suggestion of Jeff to use a UTF8-native path is also something
I'd wonder about, rather than do the uint32 codepoint dance. Not sure
about the complexity, but Jeff has mentioned me in the past for his
work on native UTF-8 collation support that relying on the codepoints
in core was much, much, much faster than ICU (that was for sorting,
but we'd burn less CPU for sure here as well).
4) What is now the largest chunk, with the table recreations. I'd
suggest some more underground work here in terms of tests, identifying
potential gaps in our existing testing that's cheap and good enough
for meaningful corner cases. I am wondering if we don't have more to
extract here. One idea: we could store the combining class during
decomposition, saving one extra lookup?

The NF[K]{C,D} operations have been done as they are as a matter of
simplicity, relying on the introduction of the normalization API done
for SCRAM in v10, where performance was not (and is still not)
critical.

As a whole, this patch is too large as-is to be digested in a review.
Due to the pre-authentication requirement, something not rock-solid is
as good as nothing (one can say that for all features, but this one is
particularly exposed).
(Spoiler: I rely on my brain for this kind of work, gentler for the
planet with less tokens and memory burnt. A patch split as a bunch of
pieces is more edible for me, personally.)
--
Michael

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2026-07-01 06:29:25 from Henson Choi

Browse pgsql-hackers by date

	From	Date	Subject
Previous Message	Jim Jones	2026-07-03 07:34:22	Re: Truncate logs by max_log_size