| From: | SungJun Jang <sjjang112233(at)gmail(dot)com> |
|---|---|
| To: | assam258(at)gmail(dot)com |
| Cc: | Junwang Zhao <zhjwpku(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, li(dot)evan(dot)chao(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, Tatsuo Ishii <ishii(at)postgresql(dot)org>, thomas(dot)munro(at)gmail(dot)com, michael(at)paquier(dot)xyz |
| Subject: | Re: Remove invalid SS2/SS3 handling from EUC-KR routines |
| Date: | 2026-05-13 05:50:28 |
| Message-ID: | CAE+cgNj3yqMWvumU4Wi_6Sh4PptwO_1xee5+fKGyF=RX352BQQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Henson,
Thanks for the thorough review and the wider audit.
> One small observation: after this patch, EUC-KR's mb routines become
> structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
> branch, maxmblen=2), which is a nice consistency win and arguably the
> right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
> the commit message.
Added to the commit message.
> I don't think any real client code relies on 3, but the release notes
> should mention it.
Added a release note entry for the pg_encoding_max_length('EUC_KR')
change (3 -> 2) in doc/src/sgml/release-19.sgml.
v2 attached. No other changes from v1.
Thanks again for the review
Regards,
SungJun Jang
2026년 5월 12일 (화) 오후 8:39, Henson Choi <assam258(at)gmail(dot)com>님이 작성:
> Hi SungJun,
>
> Thanks for the patch. I applied v1 on top of master; it builds
> cleanly and the regression tests pass here. I agree with the
> direction; a few comments inline.
>
> Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0
>> (ASCII) and G1 (KS X 1001). G2 and G3 are not designated; the
>> single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
>> as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.
>>
>
> Right. I checked the existing pg_euckr_verifychar() in wchar.c and
> it indeed has no SS2/SS3 branch -- it accepts only ASCII and 0xA1-0xFE
> lead bytes via IS_EUC_RANGE_VALID(). So the verifier and the new
> mblen()/mb2wchar_with_len() now tell the same story for valid input,
> which is the goal.
>
> I also did a wider audit for EUC-KR-specific SS2/SS3 handling outside
> wchar.c and found none: the UTF-8 <-> EUC-KR conversion proc is
> clean, pg_eucjp_increment in mbutils.c is EUC-JP only, and EUC-KR
> falls back to pg_generic_charinc which delegates to
> pg_euckr_verifychar (already SS2/SS3-free). So the three functions
> this patch rewrites are the only entry points; no dangling SS2/SS3
> path remains for EUC-KR after the patch.
>
> - Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].
>>
>
> This is the only user-visible change, via pg_encoding_max_length('EUC_KR')
> (see mbutils.c). The value drops from 3 to 2. I don't think any
> real client code relies on 3, but the release notes should mention it.
>
> One small observation: after this patch, EUC-KR's mb routines become
> structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
> branch, maxmblen=2), which is a nice consistency win and arguably the
> right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
> the commit message.
>
> +1 from me.
>
>
> ---- Side note on EUC-CN ----
>
> GB 2312 under EUC-CN appears to be in the same standards situation
> -- the existing pg_euccn_mblen comment in wchar.c states "CS2 and
> CS3 are not defined for EUC_CN", and the af79c30dc3e commit message
> similarly says "EUC_CN supports only 1- and 2-byte sequences (CS0,
> CS1)" -- yet pg_euccn_* still carries SS2/SS3 branches and keeps
> maxmblen=3. As I read it, that shape was a
> deliberate choice in commit af79c30dc3e ("Fix encoding length for
> EUC_CN", CVE-2026-2006) -- the minimal back-patchable fix -- and
> the commit message seems to leave the door open for master to
> "harmonize in a different direction", though I may be reading more
> into it than was intended.
>
> An analogous self-contained cleanup of EUC-CN looks like a natural
> follow-up. Historically the EUC code in wchar.c was shaped by
> Japanese and Western contributors -- which is why the shared
> pg_euc_* helpers carry JIS X 0201/0212/0208 assumptions -- and
> EUC-CN inherited that shape by delegation. With the Chinese
> contributor community now well established in the project, an
> EUC-CN cleanup feels like a natural fit for contributors closer to
> that ecosystem, who can also supply native test data, in the same
> way KS X 2901 grounds this patch on the Korean side. Noting it
> here so the idea stays on the archive; no action requested in this
> thread.
>
> Regards,
> Henson Choi
>
>>
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Make-EUC-KR-encoding-routines-self-contained.patch | application/octet-stream | 5.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chengpeng Yan | 2026-05-13 06:06:16 | Re: Add a greedy join search algorithm to handle large join problems |
| Previous Message | Zhenwei Shang | 2026-05-13 05:49:31 | Re: [PATCH] psql: Add missing IO option to EXPLAIN tab completion |