Re: Remove invalid SS2/SS3 handling from EUC-KR routines

From: SungJun Jang <sjjang112233(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Henson Choi <assam258(at)gmail(dot)com>, Junwang Zhao <zhjwpku(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, li(dot)evan(dot)chao(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, Tatsuo Ishii <ishii(at)postgresql(dot)org>, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Remove invalid SS2/SS3 handling from EUC-KR routines
Date: 2026-05-13 13:52:36
Message-ID: CAE+cgNhAxQdZ6h_CYXhj77b-JQw_F1NWv6JZkxDV=Ac_3ceU7A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Michael,

Thanks for the review.

> I unfortunately cannot read the PNG file added at the top of the
> thread, and it is a bit hard to check what you are doing here.

Apologies for that. KS standards are published in Korean only; the
following material is provided to assist verification.

Standards referenced:
- KS X 2901: "UNIX — Hangeul environment"
- KS X 1001: "Code for Information Interchange (Hangeul and Hanja)"
- KS X 1003: "Code for Information Interchange (Latin)"

The relevant clause is at:

https://standard.go.kr/KSCI/api/std/viewMachine.do?reformNo=07&tmprKsNo=KSX2901&formType=STD

Navigation: left panel -> "4 확장 유닉스 부호계-한국 EUC-KR(Extended UNIX
Code-KoRea)" -> "4.2 부호계 1(문자 집합 1)"

The table at that clause reads:

Code set Code value representation Character set used
0 0XXXXXXX KS X 1003 (ASCII)
1 1XXXXXXX 1XXXXXXX KS X 1001
2 SS2 1XXXXXXX [1XXXXXXX [...]] Undefined
3 SS3 1XXXXXXX [1XXXXXXX [...]] Undefined

Accompanying normative text:
Korean: "유닉스 시스템에서 한글을 지원하기 위해서는 한글의 확장
유닉스 부호계-한국(EUC-KR) 표현 형식을 지원해 주어야 한다.
특히 부호계(문자 집합) 2, 3에 쓰는 부호계가 정의되어 있지
않으므로 앞으로 필요하면 부호계를 정의하여 쓸 수 있다."
English: "To support Korean in UNIX systems, the EUC-KR format must
be supported. In particular, since the character sets for
code set 2 and code set 3 are not defined, they may be
defined and used in the future if necessary."

The PNG is a screenshot of this exact clause; the text above is its
full content.

> - the existence of src/test/locale/, which may be impacted by what
> you are changing here. Perhaps we should try to modernize a bit
> this area, so as tests can become easier each time we try to
> change the conversion tables?

src/test/locale/ is a collation/ctype test infrastructure: it
validates sort order and libc character-class functions (isalpha and
friends) using locale-dependent data. This patch changes the
encoding layer (maxmblen, mblen, dsplen), which that infrastructure
does not exercise. Adding an EUC-KR directory there would not verify
the invariants this patch is locking down.

There is also a practical concern: src/test/locale/ uses a 1998-era
shell harness, is not integrated into check-world, and is not run by
the buildfarm. Tests added there would provide no ongoing coverage.

I agree that modernizing src/test/locale/ is worthwhile. I would
suggest doing that as a separate follow-up thread after this patch
merges, rather than blocking on it here.

> - the proposed patch has zero regression tests. This could be used
> to check the behavior before *and* after the patch.

Agreed. Before submitting a new version, I would like to confirm
the approach.

I am planning to split into two patches:

0001 -- Baseline capture. Adds four tests to
src/test/regress/sql/euc_kr.sql covering
pg_encoding_max_length, 1-byte (G0/ASCII) handling,
2-byte (G1/KS X 1001) handling, and SS2 (0x8E) lead-byte
rejection. The expected output records the current
(pre-fix) behavior, so pg_encoding_max_length shows 3.
This patch is not a bug fix -- it captures what master
does today. It applies to master as-is and all tests pass.

0002 -- Actual fix: wchar.c, charset.sgml, and release notes.
The expected output changes by exactly one line:

- 3
+ 2

All other test results remain unchanged, which is
evidence that valid EUC_KR data is unaffected.

The reason for putting the wrong value (3) in 0001's expected output
is so that 0002's diff speaks for itself: exactly one line changes,
nothing else moves. Without 0001, it would be unclear why the wrong
value appeared in the expected output in the first place.

Does this structure work for you, or would you prefer a different
approach?

Regards,
SungJun Jan

2026년 5월 13일 (수) 오후 2:14, Michael Paquier <michael(at)paquier(dot)xyz>님이 작성:

> On Tue, May 12, 2026 at 08:39:04PM +0900, Henson Choi wrote:
> > One small observation: after this patch, EUC-KR's mb routines become
> > structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
> > branch, maxmblen=2), which is a nice consistency win and arguably the
> > right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
> > the commit message.
>
> I unfortunately cannot read the PNG file added at the top of the
> thread, and it is a bit hard to check what you are doing here.
>
> I can however point out two things:
> - the existence of src/test/locale/, which may be impacted by what you
> are changing here. Perhaps we should try to modernize a bit this
> area, so as tests can become easier each time we try to change the
> conversion tables?
> - the proposed patch has zero regression tests. This could be used to
> check the behavior before *and* after the patch.
> --
> Michael
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zsolt Parragi 2026-05-13 14:03:28 Re: Try a presorted outer path when referenced by an ORDER BY prefix
Previous Message Alena Rybakina 2026-05-13 13:51:00 Re: pull-up subquery if JOIN-ON contains refs to upper-query