| From: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
|---|---|
| To: | assam258(at)gmail(dot)com |
| Cc: | hlinnaka(at)iki(dot)fi, thomas(dot)munro(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, jtvjtv(at)gmail(dot)com, vasukianand0119(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
| Subject: | Re: BUG #19354: JOHAB rejects valid byte sequences |
| Date: | 2026-04-16 04:53:42 |
| Message-ID: | 20260416.135342.2217670018973462320.ishii@postgresql.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
Hi Henson,
Thank you for the patch!
> Diagnosis
> ---------
>
>
> pg_johab_mblen() in src/common/wchar.c delegates to pg_euc_mblen(),
> whose relevant branches treat 0x8F (EUC's SS3) as a 3-byte prefix and
> any other high-bit byte as a 2-byte prefix. pg_johab_verifychar()
> then requires each trail byte to satisfy IS_EUC_RANGE_VALID(), defined
> in the same file as ((c) >= 0xa1 && (c) <= 0xfe). Neither rule
> corresponds to the standard:
>
>
> * JOHAB has no three-byte sequences. 0x8F is simply a valid Hangul
> lead byte (it lies in the 0x84-0xD3 Hangul syllable range from
> Table 1) that begins a normal 2-byte sequence; EUC's SS3 handling
> spuriously inflates its length to 3.
> * Hangul trail bytes are 0x41-0x7E or 0x81-0xFE; the other three
> categories use 0x31-0x7E or 0x91-0xFE. Restricting trail bytes to
> 0xA1-0xFE rejects large portions of the standard, including the
> sequences in the bug report. 0x5C (ASCII backslash) is a valid
> Hangul trail byte, which is exactly what Jeroen's unit test
> surfaced.
From what he showd in the post, I think the analysis is correct.
> Patch
> -----
>
>
> The attached 0001-Fix-JOHAB-encoding-validation.txt makes these
> changes:
The patch looks good to me. Also reegression tests passed here.
> Compatibility
> -------------
>
>
> The mapping tables themselves are unchanged. Byte sequences that
> decode successfully today continue to decode to the same characters;
> the change is strictly additive in that previously-rejected sequences
> now succeed. Because JOHAB is a client-only encoding there is no
> on-disk representation to reconcile, so back-branch behaviour would
> move from a strict subset of valid JOHAB to full valid JOHAB, without
> reinterpreting any byte sequence that was previously accepted. I
> believe that is safe to back-patch, but confining the change to v19
> is also entirely reasonable if the project prefers to limit the
> exposure.
* Category Lead byte Trail byte
* -------------------- ----------- ---------------------
* Hangul syllables 0x84 - 0xD3 0x41 - 0x7E, 0x81 - 0xFE
* User-defined area A 0xD8 0x31 - 0x7E, 0x91 - 0xFE
* Other characters 0xD9 - 0xDE 0x31 - 0x7E, 0x91 - 0xFE
* Hanja 0xE0 - 0xF9 0x31 - 0x7E, 0x91 - 0xFE
Current JOHAB verify function accepts byte sequences falling into one
of these 3 categories (except ASCII):
(2-byte): SS2(0x8E) + 0xA1 - 0xDF
(2-byte): 0xA1 - 0xFE + 0xA1 - 0xFE
(3-byte): SS3(0x8F) + 0xA1 - 0xFE + 0xA1 - 0xFE
The 2-byte sequences fall into one of the JOHAB categories above. The
3-byte sequences may fall into one of the JOHAB categories if
subsequent (the 4th byte) is accidentally in ASCII range. Otherwise,
they will be rejected while converting to UTF-8 before storing data
into database.
Despite the fact that the current JOHAB verify function is wrong, all
byte sequences that have been already accepted are also in valid JOHAB
range, as Henson said. This means that existing UTF-8 database
populated with data client encoding being set to JOHAB can be safely
used after patching.
> Why keep it rather than remove it
> ---------------------------------
>
>
> I understand the appeal of simply deleting a dead-looking encoding,
> and Thomas' removal patch is clean work. However, Korean archival
> data from the 1990s (government records, academic repositories, early
> online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
> in PostgreSQL provides a straightforward ingest path
> (client_encoding=JOHAB, convert_from, then store as UTF-8). Once
> removed, that path closes with no obvious alternative short of
> preprocessing outside PostgreSQL. Fixing the verifier preserves the
> capability at the cost of a ~30-line correction plus tests.
+1.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Thomas Munro | 2026-04-16 05:25:01 | Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) |
| Previous Message | Masahiko Sawada | 2026-04-15 20:50:01 | Re: TRAP: failed Assert("offsets[i] > offsets[i - 1]"), File: "tidstore.c" |