Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

From: Diego Frias <mail(at)dzfrias(dot)dev>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization
Date: 2026-06-04 16:32:53
Message-ID: D6D525DE-C1F2-498D-829C-396240337B59@dzfrias.dev
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Looks great! Thanks for letting me know where the tests live. I’ll
try to get these tests in the official Unicode test suite, too. Should
help future implementors.

Thanks,
Diego

> On Jun 3, 2026, at 9:07 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:
>> In short, TCount actually counts 1 more than the number of T
>> syllables; this is so s % TCount == 0 implies that s has no T
>> syllable (because the 0th place represents the absence of a T
>> syllable), where s is the s-index of a precomposed Hangul
>> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
>> syllable, the composition algorithm yields a nonsense character when
>> 0x11A7 is put in the T position.
>
> Oops. Yes, including TBASE in the recomposition is incorrect, finding
> your quote here (TBase is set to one less..):
> https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688
>
> The character gets eaten by the normalization. Pas glop.
>
>> Let me know if this patch needs anything else. I can write a test
>> for this, but it looks like the current testing setup in
>> src/common/norm_test.c only runs the Unicode test suite and isn’t
>> built for writing custom tests. If that is something of interest,
>> though, I’m happy to add that to this patch.
>
> We have a set of tests in src/test/regress/sql/unicode.sql that would
> fit nicely with what you want to address here. For this specific
> problem, this would work:
> SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';
>
> How about adding more normalization check patterns, while on it? I am
> finishing with the attached, all things combined. Diego. what do you
> think?
> --
> Michael
> <0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-06-04 16:36:34 Re: [PATCH] Fix compiler warnings by using designated initializers
Previous Message Tom Lane 2026-06-04 16:25:52 Re: Use ereport() instead of elog() for invalid weights in setweight()