| From: | Michael Paquier <michael(at)paquier(dot)xyz> |
|---|---|
| To: | Diego Frias <mail(at)dzfrias(dot)dev> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization |
| Date: | 2026-06-04 04:07:00 |
| Message-ID: | aiD55PbCxUKuScRr@paquier.xyz |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:
> In short, TCount actually counts 1 more than the number of T
> syllables; this is so s % TCount == 0 implies that s has no T
> syllable (because the 0th place represents the absence of a T
> syllable), where s is the s-index of a precomposed Hangul
> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
> syllable, the composition algorithm yields a nonsense character when
> 0x11A7 is put in the T position.
Oops. Yes, including TBASE in the recomposition is incorrect, finding
your quote here (TBase is set to one less..):
https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688
The character gets eaten by the normalization. Pas glop.
> Let me know if this patch needs anything else. I can write a test
> for this, but it looks like the current testing setup in
> src/common/norm_test.c only runs the Unicode test suite and isn’t
> built for writing custom tests. If that is something of interest,
> though, I’m happy to add that to this patch.
We have a set of tests in src/test/regress/sql/unicode.sql that would
fit nicely with what you want to address here. For this specific
problem, this would work:
SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';
How about adding more normalization check patterns, while on it? I am
finishing with the attached, all things combined. Diego. what do you
think?
--
Michael
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch | text/plain | 5.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tatsuo Ishii | 2026-06-04 04:21:08 | Re: Row pattern recognition |
| Previous Message | shveta malik | 2026-06-04 03:43:52 | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |