| From: | Diego Frias <mail(at)dzfrias(dot)dev> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization |
| Date: | 2026-06-01 18:38:32 |
| Message-ID: | B92ED640-7D4A-4505-B09F-3548F58CBB16@dzfrias.dev |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi hackers
I was browsing the PostgreSQL’s Unicode normalization code and found an issue where the composition algorithm recognizes 0x11A7 as a T syllable and combines it with subsequent S and V syllables. Per the Unicode specification:
TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1.
In short, TCount actually counts 1 more than the number of T syllables; this is so s % TCount == 0 implies that s has no T syllable (because the 0th place represents the absence of a T syllable), where s is the s-index of a precomposed Hangul character. Anyway, since PostgreSQL recognizes 0x11A7 as a T syllable, the composition algorithm yields a nonsense character when 0x11A7 is put in the T position. See https://github.com/unicode-rs/unicode-normalization/blob/576ae0b1407dd14854876c93f1a348df0c19dffe/src/normalize.rs#L218 for a comment on this bug in Rust’s unicode-rs, and https://github.com/JuliaStrings/utf8proc/commit/0260ba56c81e5ef6f06c0804034a36284bcb8710 for a similar contribution I made to JuliaStrings/utf8proc a few months ago.
Let me know if this patch needs anything else. I can write a test for this, but it looks like the current testing setup in src/common/norm_test.c only runs the Unicode test suite and isn’t built for writing custom tests. If that is something of interest, though, I’m happy to add that to this patch.
Best,
Diego
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.patch | application/octet-stream | 1.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2026-06-01 18:53:17 | Re: sandboxing untrusted code |
| Previous Message | Shlok Kyal | 2026-06-01 18:36:04 | Re: pg_createsubscriber: allow duplicate publication names |