Remove invalid SS2/SS3 handling from EUC-KR routines

From: SungJun Jang <sjjang112233(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, assam258(at)gmail(dot)com, Tatsuo Ishii <ishii(at)postgresql(dot)org>, thomas(dot)munro(at)gmail(dot)com
Subject: Remove invalid SS2/SS3 handling from EUC-KR routines
Date: 2026-05-12 06:09:49
Message-ID: CAE+cgNgTWvCT2+HZYRzQA8-wSQrj-FjPQQNffn=_3DOpz0pKgA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0
(ASCII) and G1 (KS X 1001). G2 and G3 are not designated; the
single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.

PostgreSQL currently has two inconsistencies with this:

1. Table 23.3 in the documentation lists EUC_KR Bytes/Char as "1-3".
2. pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len()
delegate to the shared pg_euc_* helpers, which include SS2 (0x8E)
and SS3 (0x8F) handling written for encodings that designate G2/G3
(e.g. EUC-JP, EUC-TW).

The following evidence confirms that SS2/SS3 are not part of EUC-KR:

- KS X 2901 defines EUC-KR with the following code set table
(see attached ksx2901-euc-kr-code-set-table.png):

Code set Code value representation Character set
0 0XXXXXXX KS X 1003 (ASCII)
1 1XXXXXXX 1XXXXXXX KS X 1001
2 SS2 1XXXXXXX [1XXXXXXX [...]] undefined
3 SS3 1XXXXXXX [1XXXXXXX [...]] undefined

The standard states: "In particular, since the character sets for
code set 2 and code set 3 are not defined, they may be defined and
used in the future if necessary."

- pg_euckr_verifychar() (src/common/wchar.c:1044) already has no SS2/SS3
branch; it accepts only 0x00-0x7F (G0, ASCII) and 0xA1-0xFE lead bytes
(G1, KS X 1001). Any 0x8E or 0x8F byte is rejected.

This patch fixes both:

- Replace the three delegating functions with EUC-KR-specific
implementations that recognise only G0 (1 byte) and G1 (2 bytes).
- Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].
- Correct Table 23.3 from "1-3" to "1-2".

pg_euckr_verifychar() already has no SS2/SS3 branch, so SS2/SS3 bytes
are never admitted as valid lead bytes. This patch therefore introduces
no behavior change for valid EUC-KR data.

This was discussed in [1].

[1]
https://postgr.es/m/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com

--
SungJun Jang

Attachment Content-Type Size
image/png 72.4 KB
v1-0001-Make-EUC-KR-encoding-routines-self-contained.patch application/octet-stream 4.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-05-12 06:20:55 Re: [PATCH] Resolve unknown-type literals in GRAPH_TABLE COLUMNS
Previous Message Chao Li 2026-05-12 06:07:30 Re: Fix pg_stat_statements display of normalized FETCH counts