Re: Improve the performance of Unicode Normalization Forms.

From: Henson Choi <assam258(at)gmail(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve the performance of Unicode Normalization Forms.
Date: 2026-05-05 08:02:22
Message-ID: CAAAe_zCPd0i58viVqikOmLsTBMhP46E1FjSKFka6Ms3zsSDTxQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alexander,

Thank you for the patch. I applied v10 to a clean checkout, built it,
and ran the regression suite — all tests pass. Nice work.

I have attached a review report below.

Detailed gcov coverage in HTML format is included in coverage.tgz.

Unicode Normalization Performance Patch Review
===============================================

Patch: Improve the performance of Unicode Normalization Forms
Commitfest: https://commitfest.postgresql.org/patch/5802

Review Methodology:
Quality assessment focused on key code paths and coverage analysis.
Code coverage measured using gcov (modified lines only), supplemented
by normalization-check (NormalizationTest.txt).

Limitations:
- Not a security audit or formal verification

TABLE OF CONTENTS
-----------------

1. Executive Summary
2. Issues Found
2.1 Critical / Major
2.2 Minor (Review Needed)
3. Test Coverage
3.1 Covered Areas
3.2 Untested Items
4. Code Coverage Analysis
4.1 Overall Coverage
4.2 Uncovered Code Risk Assessment
5. Commit Analysis
6. Recommendations
7. Conclusion

1. EXECUTIVE SUMMARY
--------------------

Overall assessment: GOOD

The patch replaces two separate lookup mechanisms (bsearch for FRONTEND,
perfect hash for BACKEND) with a single two-level sparse array. This
unifies code paths that diverged in 2020 for pragmatic reasons, and
shrinks the generated table from ~19K to ~9K lines. The approach is
well-established in Unicode processing libraries (CPython, ICU, Tcl).
No critical issues found.

Performance was measured both by the author and independently by Jeff Davis
using ICU v76.1 test data (100 iterations, ICU collation test strings):

Without patch: NFD 9.1s NFC 14.5s
With patch: NFD 1.6s NFC 3.0s
ICU v76.1: NFD 1.9s NFC 1.1s

Thread: https://postgr.es/m/7859e5ef-a574-4199-a69b-6fee26711521@gmail.com

Rating by Area:
- Code Quality: Good (clean unification, no style violations)
- Test Coverage: Adequate (95.4% modified-line coverage)
- Documentation: N/A (no user-facing documentation changes)
- Build/Regress: Pass

2. ISSUES FOUND
---------------

2.1 Critical / Major

None.

2.2 Minor (Review Needed)

#1 [Design] GenerateSparseArray.pm - zero value dual semantics

Index value 0 serves as both "entry not found" and a valid array
index. This works today because all callers reserve position 0 as a
null entry, but the convention is implicit and unenforced. A future
caller that does not follow it would silently return wrong data.
Jeff Davis suggested #define EMPTY 0xFFFF to make the sentinel
explicit. The author's response was non-committal. This remains open.

#2 [Code] unicode_norm.c:476-480 - OOM path untestable

ALLOC() failure in unicode_normalize() returns NULL, but this branch
is unreachable in BACKEND (palloc() throws on OOM) and untestable
without fault injection.

Recommend manual review of all FRONTEND callers to confirm NULL return
is handled correctly.

3. TEST COVERAGE
----------------

3.1 Covered Areas

Existing tests in src/test/regress/sql/unicode.sql exercise all four
normalization forms (NFC, NFD, NFKC, NFKD), idempotency, empty string,
IS NORMALIZED predicate, and error cases.

3.2 Untested Items

The following lines were uncovered before this review. Sample tests
have been added to unicode.sql and verified via pg-regress.

Line Code path
-------------------------------------------------------------------------------
unicode_norm.c:337 canonical_reorder() in Hangul decomposition path
unicode_norm.c:414 canonical_reorder() in non-Hangul decomposition path
unicode_norm.c:476-480 ALLOC() failure — FRONTEND OOM only, untestable
unicode_norm.c:550 FREE(classes) when input decomposes to >512 chars

Tests added for lines 337, 414, 550:

-- Multiple combining marks before a Hangul syllable trigger canonical
-- reorder before the Hangul fast path writes its jamo decomposition.
SELECT normalize(U&'\0308\0301\AC00', NFD) = U&'\0308\0301\1100\1161'
COLLATE "C" AS test_hangul_reorder;

-- Multiple combining marks before a precomposed character trigger
-- canonical reorder when the starter from its decomposition is written.
SELECT normalize(U&'\0308\0301\00E0', NFD) = U&'\0308\0301\0061\0300'
COLLATE "C" AS test_decomp_reorder;

-- class_buf is uint8[512]; repeat(U+00E4, 300) decomposes to 600 chars,
-- forcing heap allocation for the combining class buffer.
SELECT length(normalize(repeat(U&'\00E4', 300), NFD)) = 600 AS
test_large_decomp;

4. CODE COVERAGE ANALYSIS
-------------------------

4.1 Overall Coverage (modified lines only)

95.4% (146 / 153 lines) — measured on unicode_norm.c.
GenerateSparseArray.pm, PrettyLine.pm, generate-unicode_norm_table.pl,
and unicode_norm_table.h are build-time artifacts; runtime coverage not
applicable.

4.2 Uncovered Code Risk Assessment

The 7 uncovered lines fall into two groups:

Lines 337, 414, 550 — reachable but required edge-case inputs.
Tests have been added (see section 3.2).

Lines 476-480 — OOM path, unreachable in BACKEND. The sole FRONTEND
production caller (saslprep.c) handles NULL correctly.

No uncovered code presents a functional or security risk.

5. COMMIT ANALYSIS
------------------

4 commits (2 preparatory + 2 core):

Commit Area Files +/- Key Content
-------------------------------------------------------------------------------
1 Tools 1 +132/0 PrettyLine.pm - line formatter
2 Tools 1 +374/0 GenerateSparseArray.pm - sparse
array generator
3 Core 7 +6485/-12534 Sparse array tables + lookup,
remove perfect hash
4 Refactor+Perf 3 +3746/-3223 unicode_normalize() rewrite,
compat decomp separation

Net change: -5,020 lines. unicode_norm_hashfunc.h deleted (-3,025 lines);
unicode_norm_table.h compacted from ~19K to ~9K lines.

6. RECOMMENDATIONS
------------------

1. Add tests for lines 337, 414, and 550 to the regression suite.
Sample SQL provided in section 3.2 — all three verified correct.

2. Revisit zero value semantics in GenerateSparseArray.pm (see section 2.2
#1).
The current code is correct, but the "0 means not found" convention is
implicit. Consider #define EMPTY 0xFFFF before adding more callers.

3. Review FRONTEND callers of unicode_normalize() to confirm NULL return
is handled correctly (see section 2.2 #2).

7. CONCLUSION
-------------

The patch is a well-scoped improvement that closes a gap left open in
2020: the FRONTEND path was kept on bsearch to avoid inflating libpq,
while BACKEND used a perfect hash. The sparse array is smaller than the
hash table yet faster than bsearch, making it viable for both contexts.

The only item worth discussing before commit is whether to include the
three new regression tests from section 3.2. Everything else is clean.

Attachment Content-Type Size
coverage.tgz application/x-gzip 14.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-05-05 08:08:45 Re: Row pattern recognition
Previous Message Dmitrii Bondar 2026-05-05 07:48:30 Re: Pgbench: remove synchronous prepare