| From: | SungJun Jang <sjjang112233(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org, assam258(at)gmail(dot)com, Tatsuo Ishii <ishii(at)postgresql(dot)org>, thomas(dot)munro(at)gmail(dot)com |
| Subject: | Remove invalid SS2/SS3 handling from EUC-KR routines |
| Date: | 2026-05-12 06:09:49 |
| Message-ID: | CAE+cgNgTWvCT2+HZYRzQA8-wSQrj-FjPQQNffn=_3DOpz0pKgA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0
(ASCII) and G1 (KS X 1001). G2 and G3 are not designated; the
single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.
PostgreSQL currently has two inconsistencies with this:
1. Table 23.3 in the documentation lists EUC_KR Bytes/Char as "1-3".
2. pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len()
delegate to the shared pg_euc_* helpers, which include SS2 (0x8E)
and SS3 (0x8F) handling written for encodings that designate G2/G3
(e.g. EUC-JP, EUC-TW).
The following evidence confirms that SS2/SS3 are not part of EUC-KR:
- KS X 2901 defines EUC-KR with the following code set table
(see attached ksx2901-euc-kr-code-set-table.png):
Code set Code value representation Character set
0 0XXXXXXX KS X 1003 (ASCII)
1 1XXXXXXX 1XXXXXXX KS X 1001
2 SS2 1XXXXXXX [1XXXXXXX [...]] undefined
3 SS3 1XXXXXXX [1XXXXXXX [...]] undefined
The standard states: "In particular, since the character sets for
code set 2 and code set 3 are not defined, they may be defined and
used in the future if necessary."
- pg_euckr_verifychar() (src/common/wchar.c:1044) already has no SS2/SS3
branch; it accepts only 0x00-0x7F (G0, ASCII) and 0xA1-0xFE lead bytes
(G1, KS X 1001). Any 0x8E or 0x8F byte is rejected.
This patch fixes both:
- Replace the three delegating functions with EUC-KR-specific
implementations that recognise only G0 (1 byte) and G1 (2 bytes).
- Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].
- Correct Table 23.3 from "1-3" to "1-2".
pg_euckr_verifychar() already has no SS2/SS3 branch, so SS2/SS3 bytes
are never admitted as valid lead bytes. This patch therefore introduces
no behavior change for valid EUC-KR data.
This was discussed in [1].
[1]
https://postgr.es/m/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com
--
SungJun Jang
| Attachment | Content-Type | Size |
|---|---|---|
|
|
image/png | 72.4 KB |
| v1-0001-Make-EUC-KR-encoding-routines-self-contained.patch | application/octet-stream | 4.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Bapat | 2026-05-12 06:20:55 | Re: [PATCH] Resolve unknown-type literals in GRAPH_TABLE COLUMNS |
| Previous Message | Chao Li | 2026-05-12 06:07:30 | Re: Fix pg_stat_statements display of normalized FETCH counts |