Re: BUG #19354: JOHAB rejects valid byte sequences

From: Henson Choi <assam258(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeroen Vermeulen <jtvjtv(at)gmail(dot)com>, VASUKI M <vasukianand0119(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
Date: 2026-04-15 01:20:06
Message-ID: CAAAe_zCLVunjt1u+2E86shwc3hk1x4bzUyU86nY1fq-nAVYN0Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi hackers,

> > So +1 from me, set the phasers to git rm.
>
> +1
>
> > Wait until 20, or just do it now?
> Let's just do it now.
>

Following up on my earlier note with an actual review of the primary
Korean national standard and a fix patch. The result turns out to be
small, and I believe it resolves the ambiguity that drove the removal
proposal.

Standard reference
------------------

The authoritative specification for JOHAB is Annex 3 of KS X 1001
(originally KS C 5601-1992 Annex 3, renumbered KS X 1001:1992 and
republished as KS X 1001:2004), published by the Korean Agency for
Technology and Standards (KATS) and available from the national
e-standards portal:

https://standard.go.kr/KSCI/api/std/viewMachine.do?reformNo=08&tmprKsNo=KSX1001&formType=STD

The decisive passages are quoted below in the original Korean with an
English translation, so non-Korean readers can verify the byte ranges
the fix implements.

Two terms from the standard recur throughout the quoted passages:

* 완성형 부호계 (romanised "WANSUNG", literally "completion-form
code set"). Each Hangul syllable is assigned a single code point
drawn from a fixed table of pre-composed syllables. The main
body of KS X 1001 defines such a table of 2,350 syllables; per
the standard's commentary, that subset was chosen by frequency
analysis over samples from publishing, print media, industry,
academia and dictionaries at the time of the 1987 revision,
which is why some valid modern syllables (e.g. 뢔, 쌰, 쎼, 쓔,
쬬) were deliberately excluded. EUC-KR is the packed 8-bit form
of that WANSUNG table, and Microsoft's CP949 / UHC is a later
superset that fills in additional syllables.

* 조합형 부호계 (romanised "JOHAB", literally "combinational code
set"). Each Hangul syllable is constructed at encoding time
from 5-bit codes for the initial consonant, medial vowel, and
final consonant packed into two bytes, so all 11,172 modern
syllables are directly representable without a lookup table.
This is what Annex 3 defines and what PostgreSQL ships under
the encoding name JOHAB.

In short: completion form is a frequency-curated lookup, combinational
form is an algorithmic composition that covers the full modern Hangul
space. Unicode later adopted the combinational form's coverage as a
completion-form table: the Hangul Syllables block (U+AC00 - U+D7A3)
encodes exactly the same 11,172 modern syllables, as precomposed code
points. So today the three Korean-related encodings PostgreSQL
supports sit along this spectrum: EUC_KR (curated completion form),
UHC (extended completion form), and JOHAB (algorithmic combinational
form).

부속서 3 보조 부호계 (2바이트 조합형 부호계)
[Annex 3. Supplementary code set (two-byte combinational code)]

1. 적용 범위
[Scope]

이 부속서에서는 기본 부호계인 2바이트 완성형 부호계의 보조 부호계로서,
2바이트 조합형 부호계를 규정한다.
[This annex specifies the two-byte combinational code set as the
supplementary code set to the two-byte completion-form code set that
constitutes the main body of the standard.]

2. 도형 문자
[Graphic characters]

a) 한 글
[Hangul]
부속서 3 표 2에 규정된 첫소리 글자 19자, 가운뎃소리 글자 21자,
끝소리 글자 27자로 조합 가능한, 모든 현대 한글 글자 마디(11 172자)
및 현대 한글 낱자(67자)
[All modern Hangul syllables (11,172) and modern Hangul jamo (67)
that can be composed from the 19 initials, 21 medials, and 27
finals defined in Annex 3 Table 2.]
b) 한 자
[Hanja]
2바이트 완성형 부호계에서 규정한 한자(4 888자)
[The 4,888 Hanja defined in the two-byte completion-form code
set.]
c) 그 밖의 문자
[Other characters]
2바이트 완성형 부호계에서 규정한 문자 중에서 현대 한글 글자 마디
및 현대 한글 낱자, 한자를 제외한 도형 문자(937자)
[The 937 graphic characters defined in the completion-form code
set other than modern Hangul syllables, modern Hangul jamo, and
Hanja.]

3. 도형 문자의 배치 영역
[Graphic-character placement]

도형 문자의 배치 영역은 부속서 3 표 1과 같다.
[The placement of the graphic characters is given in Annex 3
Table 1.]

부속서 3 표 1 도형 문자의 배치 영역
[Annex 3 Table 1. Placement of graphic characters]

구 분 첫째 바이트 둘째 바이트
[Category] [Lead byte] [Trail byte]
---------------- ----------- --------------------
한글 글자마디 84H–D3H 41H–7EH, 81H–FEH
[Hangul syllables]
사용자 정의 영역 D8H 31H–7EH, 91H–FEH
[User-defined area]
기타 문자 D9H–DEH 31H–7EH, 91H–FEH
[Other characters]
한 자 E0H–F9H 31H–7EH, 91H–FEH
[Hanja]

비 고 16진수를 나타내기 위하여 맨 뒤에 H를 적는다
(10 H는 10진법으로 16이다).
[Note: a trailing H denotes a hexadecimal value
(e.g. 10H equals 16 in decimal).]

4. 한글 글자 마디의 부호값 구성 및 배열
[Encoding and layout of Hangul syllables]

각 한글 글자 마디의 부호값은 2바이트 내에 첫소리 글자 5비트,
가운뎃소리 글자 5비트, 끝소리 글자 5비트로 하여, 한글 낱자를 조합하여
표현한 값으로 정의한다. 각 한글 낱자의 순서는 최상위 비트(MSB)를 1로
하고 나서 첫소리, 가운뎃소리, 끝소리 글자가 순서대로 나오도록
구성한다.
[The code value of each Hangul syllable is defined as the composition
of the Hangul letters within two bytes: 5 bits for the initial
consonant, 5 bits for the medial vowel, and 5 bits for the final
consonant, laid out with the most-significant bit set to 1 followed
by the initial, medial, and final in that order.]

Annex 3 continues with Table 2 (5-bit jamo codes), Table 3 (row-wise
mapping between completion-form and combinational-form for Hanja and
other characters), and usage notes. Those are not needed for the
verifier fix, but they do confirm that the mapping tables we already
ship in johab_to_utf8.map line up with the standard; the same is true
of the data under unicode.org's JOHAB.TXT that Robert pointed to
earlier in the thread.

On "multiple variants": the KS national standard for JOHAB (Annex 3)
is singular and authoritative, and the mapping tables we ship match
it. The Wikipedia note about EBCDIC-based and stateful JOHAB variants
refers to niche vendor encodings that PostgreSQL never implemented.

The historical "variant" churn in Korean encoding is in fact not about
JOHAB but about the completion-form main body of KS X 1001 and its
packed form EUC-KR: Microsoft's CP949 / UHC extended WANSUNG with
additional Hangul syllables, and different vendors disagreed at the
edges. PostgreSQL already separates those concerns by carrying
EUC_KR and UHC as distinct encodings, so fixing JOHAB does not
re-open that family of ambiguities.

Diagnosis
---------

pg_johab_mblen() in src/common/wchar.c delegates to pg_euc_mblen(),
whose relevant branches treat 0x8F (EUC's SS3) as a 3-byte prefix and
any other high-bit byte as a 2-byte prefix. pg_johab_verifychar()
then requires each trail byte to satisfy IS_EUC_RANGE_VALID(), defined
in the same file as ((c) >= 0xa1 && (c) <= 0xfe). Neither rule
corresponds to the standard:

* JOHAB has no three-byte sequences. 0x8F is simply a valid Hangul
lead byte (it lies in the 0x84-0xD3 Hangul syllable range from
Table 1) that begins a normal 2-byte sequence; EUC's SS3 handling
spuriously inflates its length to 3.
* Hangul trail bytes are 0x41-0x7E or 0x81-0xFE; the other three
categories use 0x31-0x7E or 0x91-0xFE. Restricting trail bytes to
0xA1-0xFE rejects large portions of the standard, including the
sequences in the bug report. 0x5C (ASCII backslash) is a valid
Hangul trail byte, which is exactly what Jeroen's unit test
surfaced.

The consequence is that a substantial portion of johab_to_utf8.map is
unreachable today: the verifier rejects the byte sequences before
conversion is attempted. That matches Robert's observation that the
"right" mapping existed but was gated behind an incorrect rule.

Patch
-----

The attached 0001-Fix-JOHAB-encoding-validation.txt makes these
changes:

src/common/wchar.c
Rewrite pg_johab_mblen() to return 2 when the lead byte falls in
any of the ranges listed in Annex 3 Table 1, and 1 otherwise
(ASCII pass-through). Rewrite pg_johab_verifychar() to apply the
correct trail-byte range depending on whether the lead byte is a
Hangul lead byte (trail 0x41-0x7E or 0x81-0xFE) or a non-Hangul
lead byte (trail 0x31-0x7E or 0x91-0xFE). Two helper macros
IS_JOHAB_LEAD_HANGUL() and IS_JOHAB_LEAD_OTHER() express the
lead-byte classification once and are shared between mblen and
verifychar. A comment block above the implementation reproduces
Table 1 for future maintainers. Also correct
pg_wchar_table[PG_JOHAB].maxmblen from 3 to 2 so that callers
sizing buffers from maxmblen do not over-allocate and so that the
value matches the spec.

doc/src/sgml/charset.sgml
Update the JOHAB row in the character-set table to show the
maximum character length as 1-2 instead of 1-3, matching the
standard and the corrected maxmblen.

src/test/regress/sql/johab.sql
src/test/regress/expected/johab.out
src/test/regress/expected/johab_1.out
src/test/regress/parallel_schedule
A new regression test, modelled on euc_kr.sql, that runs in UTF8
databases and skips otherwise. It covers:

- the original bug sequences \x8A\x5B, \x8A\x5C, \x8A\x5D
decoding to 굍, 굎, 굏;
- the first multibyte character from JOHAB.TXT (\x84\x44 -> ㄳ),
previously rejected;
- byte sequences that already decoded under the old rules
(\x89\xEF -> 괦, \x89\xA1 -> 고) to guard against regression;
- Hanja trail bytes that used to be rejected (\xE0\x31,
\xE0\x7E, \xE0\x91);
- one representative of the "other characters" category
(\xD9\x31);
- each lead-byte gap (0x80, 0xD5, 0xDF, 0xFA) producing an
"invalid byte sequence" error;
- every trail-byte gap for both Hangul (0x40, 0x7F, 0x80) and
the non-Hangul categories (0x30, 0x7F, 0x90, 0xFF);
- an incomplete trailing byte for a valid lead byte.

Compatibility
-------------

The mapping tables themselves are unchanged. Byte sequences that
decode successfully today continue to decode to the same characters;
the change is strictly additive in that previously-rejected sequences
now succeed. Because JOHAB is a client-only encoding there is no
on-disk representation to reconcile, so back-branch behaviour would
move from a strict subset of valid JOHAB to full valid JOHAB, without
reinterpreting any byte sequence that was previously accepted. I
believe that is safe to back-patch, but confining the change to v19
is also entirely reasonable if the project prefers to limit the
exposure.

Why keep it rather than remove it
---------------------------------

I understand the appeal of simply deleting a dead-looking encoding,
and Thomas' removal patch is clean work. However, Korean archival
data from the 1990s (government records, academic repositories, early
online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
in PostgreSQL provides a straightforward ingest path
(client_encoding=JOHAB, convert_from, then store as UTF-8). Once
removed, that path closes with no obvious alternative short of
preprocessing outside PostgreSQL. Fixing the verifier preserves the
capability at the cost of a ~30-line correction plus tests.

Happy to iterate on the patch, the commit message, or the tests.
Thanks to everyone for the careful analysis that preceded this; I
recognise that the consensus was leaning toward removal, and I would
appreciate a chance to have this fix considered as an alternative.

Regards,
Henson

Attachment Content-Type Size
0001-Fix-JOHAB-encoding-validation.txt text/plain 15.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Xuneng Zhou 2026-04-15 01:30:05 Re: BUG #19006: Assert(BufferIsPinned) in BufferGetBlockNumber() is triggered for forwarded buffer
Previous Message surya poondla 2026-04-14 23:49:09 Re: BUG #19382: Server crash at __nss_database_lookup