Re: BUG #19354: JOHAB rejects valid byte sequences

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: assam258(at)gmail(dot)com
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeroen Vermeulen <jtvjtv(at)gmail(dot)com>, VASUKI M <vasukianand0119(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
Date: 2026-04-15 01:49:24
Message-ID: CA+hUKGJMrcS=hBkqVk=5pjM4w8edG=_ArASC82RqB6HQro-v-g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Apr 15, 2026 at 1:20 PM Henson Choi <assam258(at)gmail(dot)com> wrote:
> In short: completion form is a frequency-curated lookup, combinational
> form is an algorithmic composition that covers the full modern Hangul
> space. Unicode later adopted the combinational form's coverage as a
> completion-form table: the Hangul Syllables block (U+AC00 - U+D7A3)
> encodes exactly the same 11,172 modern syllables, as precomposed code
> points. So today the three Korean-related encodings PostgreSQL
> supports sit along this spectrum: EUC_KR (curated completion form),
> UHC (extended completion form), and JOHAB (algorithmic combinational
> form).

Thank you! Yes, that makes total sense. Here are my own notes
(compiled from English-language Wikipedia articles), which say
essentially the same thing + some notes about Hancom:

The Korean writing system:
1. Hanja: Chinese characters used in names, legal and historical
documents, and to disambiguate homonyms. The number of characters in
use is difficult to pin down (as in Japan and China).
2. Hangul: a phonetic system used for almost all modern Korean text.
Hangul characters are composed of 2-5 "jamo", commonly 2-3 in modern
texts, each representing a consonant/vowel.

Character set standards:
1. KS X 1001: 4,888 Hanja (of the vast number of hard to count CJK
ideographs) + 2,350 precomposed Hangul (of 11,172 theoretically
possible jamo combinations).
2. KS X 1002: added some more but no one ever implemented it,
possibly because...
3. Unicode: all 11,172 possible precomposed Hangul + individual jamo
for composition + all Hanja/Kanji/Hanzi characters known to humanity
(still growing).

Encodings:
1. EUR-KR, AKA Wansung (= "precomposed"): directly encoded KS X 1001.
2. JOHAB (= "combining"): deferred to KS X 1001 for Hanja, but
described all possible Hangul as jamo stored in bitfields.
3. UHC (= "Unified Hangul Code", invented by Microsoft): used EUR-KR
as a base but supplied all possible pre-composed Hangul and 8,222
Hanja (complete CJK as of Unicode 2.0).
4. UTF-8, UTF-16, UTF-32: Unicode.

Realpolitik that fed back into standards:
1. The Hancom "Hangul" word processor used de facto standard JOHAB
encoding, and dominated.
2. KS X 1001 recognised this and added that annex.
3. MS-DOS/Windows recognised this and called it CP1361.
4. MS-DOS/Windows switched to UHC/CP949 alongside Unicode some time
in the early to mid 90s.
5. Hancom switched to Unicode around the turn of the millennium.

I will study your patch and your analysis. It looks good on first read.

> Why keep it rather than remove it
> ---------------------------------
>
>
> I understand the appeal of simply deleting a dead-looking encoding,
> and Thomas' removal patch is clean work. However, Korean archival
> data from the 1990s (government records, academic repositories, early
> online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
> in PostgreSQL provides a straightforward ingest path
> (client_encoding=JOHAB, convert_from, then store as UTF-8). Once
> removed, that path closes with no obvious alternative short of
> preprocessing outside PostgreSQL. Fixing the verifier preserves the
> capability at the cost of a ~30-line correction plus tests.

The counter argument would be that you could use iconv
--from-code=JOHAB ..., or libiconv, or the codecs available in Python,
Java, etc for dealing with historical archived data, something that
data archivists must be very aware of. And for old Hancom word
processor files, not really of relevance to PostgreSQL, apparently
they can be imported by modern word processors.

> Happy to iterate on the patch, the commit message, or the tests.
> Thanks to everyone for the careful analysis that preceded this; I
> recognise that the consensus was leaning toward removal, and I would
> appreciate a chance to have this fix considered as an alternative.

Cool. For now I'll leave the removal on ice, and look into committing
your patch. Thanks for working on it!

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2026-04-15 02:06:18 Re: BUG #19354: JOHAB rejects valid byte sequences
Previous Message Xuneng Zhou 2026-04-15 01:30:05 Re: BUG #19006: Assert(BufferIsPinned) in BufferGetBlockNumber() is triggered for forwarded buffer