| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | lex(dot)borisov(at)gmail(dot)com |
| Cc: | michael(at)paquier(dot)xyz, hlinnaka(at)iki(dot)fi, pgsql(at)j-davis(dot)com, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Improve the performance of Unicode Normalization Forms. |
| Date: | 2026-07-01 06:29:25 |
| Message-ID: | CAAAe_zDpT1FgfXa4L8hVrkcyugrQ=5PtVT_r8DySCn0akfXEKw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Alexander,
Rebased.
>
> v12:
> * Fixed bug in unicode_normalize() logic around blocked recomposition:
> Correctly updates the starter when a new ccc == 0 character appears.
> Avoids resetting prev_ccc incorrectly after successful recomposition.
> Added an NFC regression test for: x + acute + a + acute -> x + acute + á
> * Adds overflow checks for uint8/uint16 table indexes and sizes.
> * Adds decomposition_sort_length() helper.
> * Makes generated tables match what pgindent expects.
>
While reviewing v12 I found a memory-safety issue: the two-stage lookup
tables are read out of bounds during normalization.
Cause: TwoStageTable::generate() fills Index slots only up to the last
mapped key, leaving the final block short, but the generated lookup
always indexes the full block (index_name[offset + (cp & range_mask)]).
The offset table is pre-allocated to full size; @data is not padded to
$pos. Result:
* decomp_table_index[17438]: U+2FA1E..U+2FA3F -> 17408 + (cp & 63) =
17438..17471 (up to 34 slots past the end)
* inverse_table_index[3754]: U+16D6A..U+16D7F -> 3712 + (cp & 63) =
3754..3775 (up to 22 slots past the end)
Two things kept it hidden: no existing test touches this block (so cfbot
stays green), and even a test that does won't fail without a bounds
check -- the over-read silently returns 0 and happens to give the right
answer.
Reproduction: the attached test-only patch (nocfbot-unicode-norm-oob.txt)
adds bounds assertions to the two lookups plus regression tests that hit
the boundary. On an --enable-cassert build,
SELECT normalize(U&'\00C5\+02FA1E', NFD);
SELECT normalize(U&'\+016D6A\0301', NFC);
abort the backend (TRAP: failed Assert("offset + (cp & 63) <
lengthof(...)")).
Fix: pad the final block after the key loop in generate():
$data[$pos - 1] //= $tst->{dummy};
This extends the tables to 17472/3776, leaves the reader unchanged, and
resolves those codepoints to the dummy 0 entry (NULL), which is correct.
The generator change and the regenerated unicode_norm_table.h go
together. I'd keep the bounds assertion in the lookup as a guard against
future under-padding.
Thanks,
Henson
| Attachment | Content-Type | Size |
|---|---|---|
| nocfbot-unicode-norm-oob.txt | text/plain | 3.9 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Richard Guo | 2026-07-01 06:34:18 | Re: Improve UNION's output rowcount estimate |
| Previous Message | Kirill Reshke | 2026-07-01 06:26:13 | Re: DROP INVALID INDEXES command |