[PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

From: Diego Frias <mail(at)dzfrias(dot)dev>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization
Date: 2026-06-01 18:38:32
Message-ID: B92ED640-7D4A-4505-B09F-3548F58CBB16@dzfrias.dev
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers

I was browsing the PostgreSQL’s Unicode normalization code and found an issue where the composition algorithm recognizes 0x11A7 as a T syllable and combines it with subsequent S and V syllables. Per the Unicode specification:

TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1.

In short, TCount actually counts 1 more than the number of T syllables; this is so s % TCount == 0 implies that s has no T syllable (because the 0th place represents the absence of a T syllable), where s is the s-index of a precomposed Hangul character. Anyway, since PostgreSQL recognizes 0x11A7 as a T syllable, the composition algorithm yields a nonsense character when 0x11A7 is put in the T position. See https://github.com/unicode-rs/unicode-normalization/blob/576ae0b1407dd14854876c93f1a348df0c19dffe/src/normalize.rs#L218 for a comment on this bug in Rust’s unicode-rs, and https://github.com/JuliaStrings/utf8proc/commit/0260ba56c81e5ef6f06c0804034a36284bcb8710 for a similar contribution I made to JuliaStrings/utf8proc a few months ago.

Let me know if this patch needs anything else. I can write a test for this, but it looks like the current testing setup in src/common/norm_test.c only runs the Unicode test suite and isn’t built for writing custom tests. If that is something of interest, though, I’m happy to add that to this patch.

Best,
Diego

Attachment Content-Type Size
v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.patch application/octet-stream 1.4 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2026-06-01 18:53:17 Re: sandboxing untrusted code
Previous Message Shlok Kyal 2026-06-01 18:36:04 Re: pg_createsubscriber: allow duplicate publication names