| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
| Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeroen Vermeulen <jtvjtv(at)gmail(dot)com>, VASUKI M <vasukianand0119(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
| Subject: | Re: BUG #19354: JOHAB rejects valid byte sequences |
| Date: | 2026-04-15 04:25:04 |
| Message-ID: | CAAAe_zCwaccH7h+GOtHbo_docCY-o0c5NMRuYkdz15f=KL4f0g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
>
> 3. UHC (= "Unified Hangul Code", invented by Microsoft): used EUR-KR
> as a base but supplied all possible pre-composed Hangul and 8,222
> Hanja (complete CJK as of Unicode 2.0).
Small correction: UHC's additions over EUC-KR are on the Hangul side,
not Hanja. UHC adds 8,822 pre-composed Hangul (taking Hangul coverage
from EUC-KR's 2,350 up to the full 11,172) and leaves Hanja unchanged
at KS X 1001's 4,888. I enumerated all three encodings against
PostgreSQL's current conversion tables to double-check:
Encoding Hangul Hanja
EUC_KR 2,350 4,888
UHC 11,172 4,888
JOHAB 11,172 4,888 (after this patch)
"Complete CJK as of Unicode 2.0" is off too -- Unicode 2.0's CJK
Unified Ideographs block had roughly 20,900 characters, so UHC and
JOHAB both carry only the KS X 1001 Hanja subset. The 8,222 figure
looks like it got swapped with the 8,822 Hangul number.
> Realpolitik that fed back into standards:
1. The Hancom "Hangul" word processor used de facto standard JOHAB
> encoding, and dominated.
> 2. KS X 1001 recognised this and added that annex.
Minor nit on the sequence: KS C 5601 already had a combinational annex
in its 1982 revision, but with a different bit layout from the one
Hancom's word processor used. The 1992 revision swapped the annex's
bit layout to the commercial combinational form (상용 조합형) that
the industry -- Hancom included -- had popularised. The KS X
1001:2004 commentary documents this transition explicitly ("비트
조합을 널리 쓰고 있는 이른바 상용 조합형으로 바꿈"). So "KS
recognised the de facto standard" applies to 1992, not to the annex's
first appearance.
Worth mentioning for atmosphere: that period was the tail end of the
Apple II clone / MSX era and the rise of IBM PC compatibles in Korea,
and contemporary Korean computer magazines ran running debates over
Wansung vs Johab on three axes at once -- the encoding, the keyboard
layout (두벌식 vs 세벌식, the Korean QWERTY-vs-Dvorak argument), and
the font rendering strategy (per-syllable bitmap tables for Wansung
vs jamo-composition for Johab) -- right alongside their game reviews.
The 1992 annex revision landed in the middle of that churn, not
ahead of it.
One further observation that fits your KS X 1002 note. EUC-KR isn't
really a single standard but a layered stack -- KS X 1001 (the
character set) + ISO/IEC 2022 (the code-extension skeleton) + the
AT&T-era EUC convention of pinning G0 to ASCII and G1 to the 8-bit
region, later formalised in Korea as KS X 2901. That informal
layering is precisely what let UHC land so easily: Microsoft extended
the same 8-bit region with additional Hangul, and every EUC-KR
decoder silently kept working for the covered subset.
KS X 1002 tried the opposite approach -- a formally separated
supplementary set, designated via a distinct ISO-2022 escape
sequence. The design was cleaner on paper but required every
consumer to implement set-switching for a supplementary character
range that nobody was motivated to support. UHC sidestepped this
entirely by just filling in the unused 8-bit slots. So the
structural reason 1002 lost to UHC isn't just market power; it is
that UHC matched EUC-KR's informal extensibility while 1002 demanded
strict ISO-2022 compliance. JOHAB (Annex 3) sits at the other end of
that spectrum -- a self-contained spec where a single document nails
down character set, byte layout, and composition algorithm, which is
what makes the verifier fix tractable.
A small downstream consequence of UHC's slot-filling approach is that
byte-wise comparison no longer matches Korean dictionary order: the
8,822 added Hangul land in the low 0x81-0xA0 range, ahead of the
gananada-ordered EUC-KR region. Unicode's Hangul Syllables block
(U+AC00-U+D7A3) later restored that by assigning all 11,172 syllables
algorithmically in gananada order, so UTF-8 memcmp once again
produces Korean lexicographic order -- one of the quieter practical
drivers of Korea's Unicode migration.
Credit where it's due on that outcome: getting all 11,172 precomposed
Hangul into the BMP in algorithmic gananada order (the "Korean
Hangul Mess" cleanup in Unicode 2.0, 1996) wasn't inevitable.
Engineers at Microsoft's Korean office were notable advocates for
that arrangement alongside Korean standards-body contributors and
other vendors, and the Korean computing world has been quietly
benefiting from it ever since. It's a nice detail given who's
reading this thread.
Everything else in the summary matches what I had -- thanks for the
independent write-up, and for taking another look at the patch.
> > The counter argument would be that you could use iconv
> > --from-code=JOHAB ..., or libiconv, or the codecs available in Python,
> > Java, etc for dealing with historical archived data, something that
> > data archivists must be very aware of.
>
> Sure. But it's not comfortable to remove a user-visible feature
> we've had for decades. My own primary concern about it was that a
> correct fix could require non-backwards-compatible behavior changes.
> Henson's analysis says that that's not a problem. So assuming this
> patch withstands review, I'd be much happier to see it applied than
> to remove JOHAB.
Thank you -- the backward-compat angle was the hinge I was hoping
would carry, and I'm glad the analysis held up. On the size of the
remaining audience: niche Korean standards have a small but stubborn
user base, much the way Dvorak users persist in the West. There are
still 세벌식 (Sebeolsik) keyboard users in Korea who keep hand-cut
stickers over their QWERTY-printed keycaps rather than switch back;
the JOHAB data holdouts are that kind of tail -- vanishingly small in
absolute numbers, but without a graceful alternative if we close the
door. A correctly-working JOHAB serves that tail at near-zero
ongoing cost, which is ultimately what the patch is arguing for.
> No opinion at the moment about whether to back-patch.
Happy to defer on back-patching. The behaviour change is strictly
additive (previously-rejected sequences start accepting, nothing is
reinterpreted), so the back-branches are technically safe, but v19-
only is a perfectly reasonable policy call if the project prefers
minimum surface area on the first cycle.
If you do want back-patches, I'm happy to produce per-branch
versions. Given how long the JOHAB code has been stable (as noted
earlier in the thread), my feeling is that the same patch should
apply cleanly down to PG 14 without modification. Happy to verify
that and post the set if it would help.
One personal aside: reading KS X 1001 Annex 3 end-to-end for this fix
turned out to be an unexpectedly cheerful detour -- it felt a bit
like cracking open a 6502 assembly reference from roughly the same
era. Back then I also had a popular neural-networks book that
convinced teenage-me computers would never approach human cognition
because they could never match the brain's memory scale -- a
prediction that, looking around in 2026, has aged about as well as
you'd expect. Thanks to everyone on the thread for making that
side-quest worthwhile.
Regards,
Henson
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Henson Choi | 2026-04-15 05:57:50 | Re: BUG #19354: JOHAB rejects valid byte sequences |
| Previous Message | Tom Lane | 2026-04-15 02:06:18 | Re: BUG #19354: JOHAB rejects valid byte sequences |