| From: | Diego Frias <mail(at)dzfrias(dot)dev> |
|---|---|
| To: | Michael Paquier <michael(at)paquier(dot)xyz> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization |
| Date: | 2026-06-04 16:32:53 |
| Message-ID: | D6D525DE-C1F2-498D-829C-396240337B59@dzfrias.dev |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Looks great! Thanks for letting me know where the tests live. I’ll
try to get these tests in the official Unicode test suite, too. Should
help future implementors.
Thanks,
Diego
> On Jun 3, 2026, at 9:07 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:
>> In short, TCount actually counts 1 more than the number of T
>> syllables; this is so s % TCount == 0 implies that s has no T
>> syllable (because the 0th place represents the absence of a T
>> syllable), where s is the s-index of a precomposed Hangul
>> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
>> syllable, the composition algorithm yields a nonsense character when
>> 0x11A7 is put in the T position.
>
> Oops. Yes, including TBASE in the recomposition is incorrect, finding
> your quote here (TBase is set to one less..):
> https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688
>
> The character gets eaten by the normalization. Pas glop.
>
>> Let me know if this patch needs anything else. I can write a test
>> for this, but it looks like the current testing setup in
>> src/common/norm_test.c only runs the Unicode test suite and isn’t
>> built for writing custom tests. If that is something of interest,
>> though, I’m happy to add that to this patch.
>
> We have a set of tests in src/test/regress/sql/unicode.sql that would
> fit nicely with what you want to address here. For this specific
> problem, this would work:
> SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';
>
> How about adding more normalization check patterns, while on it? I am
> finishing with the attached, all things combined. Diego. what do you
> think?
> --
> Michael
> <0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-06-04 16:36:34 | Re: [PATCH] Fix compiler warnings by using designated initializers |
| Previous Message | Tom Lane | 2026-06-04 16:25:52 | Re: Use ereport() instead of elog() for invalid weights in setweight() |