Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Diego Frias <mail(at)dzfrias(dot)dev>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization
Date: 2026-06-04 04:07:00
Message-ID: aiD55PbCxUKuScRr@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:
> In short, TCount actually counts 1 more than the number of T
> syllables; this is so s % TCount == 0 implies that s has no T
> syllable (because the 0th place represents the absence of a T
> syllable), where s is the s-index of a precomposed Hangul
> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
> syllable, the composition algorithm yields a nonsense character when
> 0x11A7 is put in the T position.

Oops. Yes, including TBASE in the recomposition is incorrect, finding
your quote here (TBase is set to one less..):
https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688

The character gets eaten by the normalization. Pas glop.

> Let me know if this patch needs anything else. I can write a test
> for this, but it looks like the current testing setup in
> src/common/norm_test.c only runs the Unicode test suite and isn’t
> built for writing custom tests. If that is something of interest,
> though, I’m happy to add that to this patch.

We have a set of tests in src/test/regress/sql/unicode.sql that would
fit nicely with what you want to address here. For this specific
problem, this would work:
SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';

How about adding more normalization check patterns, while on it? I am
finishing with the attached, all things combined. Diego. what do you
think?
--
Michael

Attachment Content-Type Size
0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch text/plain 5.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-06-04 04:21:08 Re: Row pattern recognition
Previous Message shveta malik 2026-06-04 03:43:52 Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication