Re: Remove invalid SS2/SS3 handling from EUC-KR routines

From: Henson Choi <assam258(at)gmail(dot)com>
To: SungJun Jang <sjjang112233(at)gmail(dot)com>, Junwang Zhao <zhjwpku(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, li(dot)evan(dot)chao(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, Tatsuo Ishii <ishii(at)postgresql(dot)org>, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Remove invalid SS2/SS3 handling from EUC-KR routines
Date: 2026-05-12 11:39:04
Message-ID: CAAAe_zBQ2Tg+fdnTuOPLvDjuEB1OBW5D6sUr4Q-XhzT7g44XWQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi SungJun,

Thanks for the patch. I applied v1 on top of master; it builds
cleanly and the regression tests pass here. I agree with the
direction; a few comments inline.

Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0
> (ASCII) and G1 (KS X 1001). G2 and G3 are not designated; the
> single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
> as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.
>

Right. I checked the existing pg_euckr_verifychar() in wchar.c and
it indeed has no SS2/SS3 branch -- it accepts only ASCII and 0xA1-0xFE
lead bytes via IS_EUC_RANGE_VALID(). So the verifier and the new
mblen()/mb2wchar_with_len() now tell the same story for valid input,
which is the goal.

I also did a wider audit for EUC-KR-specific SS2/SS3 handling outside
wchar.c and found none: the UTF-8 <-> EUC-KR conversion proc is
clean, pg_eucjp_increment in mbutils.c is EUC-JP only, and EUC-KR
falls back to pg_generic_charinc which delegates to
pg_euckr_verifychar (already SS2/SS3-free). So the three functions
this patch rewrites are the only entry points; no dangling SS2/SS3
path remains for EUC-KR after the patch.

- Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].
>

This is the only user-visible change, via pg_encoding_max_length('EUC_KR')
(see mbutils.c). The value drops from 3 to 2. I don't think any
real client code relies on 3, but the release notes should mention it.

One small observation: after this patch, EUC-KR's mb routines become
structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
branch, maxmblen=2), which is a nice consistency win and arguably the
right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
the commit message.

+1 from me.

---- Side note on EUC-CN ----

GB 2312 under EUC-CN appears to be in the same standards situation
-- the existing pg_euccn_mblen comment in wchar.c states "CS2 and
CS3 are not defined for EUC_CN", and the af79c30dc3e commit message
similarly says "EUC_CN supports only 1- and 2-byte sequences (CS0,
CS1)" -- yet pg_euccn_* still carries SS2/SS3 branches and keeps
maxmblen=3. As I read it, that shape was a
deliberate choice in commit af79c30dc3e ("Fix encoding length for
EUC_CN", CVE-2026-2006) -- the minimal back-patchable fix -- and
the commit message seems to leave the door open for master to
"harmonize in a different direction", though I may be reading more
into it than was intended.

An analogous self-contained cleanup of EUC-CN looks like a natural
follow-up. Historically the EUC code in wchar.c was shaped by
Japanese and Western contributors -- which is why the shared
pg_euc_* helpers carry JIS X 0201/0212/0208 assumptions -- and
EUC-CN inherited that shape by delegation. With the Chinese
contributor community now well established in the project, an
EUC-CN cleanup feels like a natural fit for contributors closer to
that ecosystem, who can also supply native test data, in the same
way KS X 2901 grounds this patch on the Korean side. Noting it
here so the idea stays on the archive; no action requested in this
thread.

Regards,
Henson Choi

>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2026-05-12 11:47:23 Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions
Previous Message Etsuro Fujita 2026-05-12 11:36:25 Re: Import Statistics in postgres_fdw before resorting to sampling.