| From: | SungJun Jang <sjjang112233(at)gmail(dot)com> |
|---|---|
| To: | assam258(at)gmail(dot)com |
| Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Junwang Zhao <zhjwpku(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, li(dot)evan(dot)chao(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, Tatsuo Ishii <ishii(at)postgresql(dot)org>, thomas(dot)munro(at)gmail(dot)com |
| Subject: | Re: Remove invalid SS2/SS3 handling from EUC-KR routines |
| Date: | 2026-05-14 04:23:55 |
| Message-ID: | CAE+cgNj-0+EncNiby=-gaH8OQQ3cKFBdMh9zCuOD8HsmtfHFNw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Henson,
Thanks for the thorough review notes and the detailed coverage list.
> I plan to work through them myself gradually, as time
> allows, and share what I find along the way; other eyes would
> be welcome.
Agreed. I will follow the same approach -- work through the list
gradually and share findings as I go.
> On the release-19.sgml hunk: release
> notes are curated by the release manager from commit messages
> after the fact, not written into the patch. It would probably
> be cleaner for v3 to omit that hunk and let the commit message
> carry the user-visible change instead.
Thanks for the clarification.. v3 attached: release-19.sgml hunk
removed. The commit message now carries the user-visible change:
"The only user-visible effect is that pg_encoding_max_length('EUC_KR')
now returns 2 instead of 3."
The charset.sgml Table 23.3 fix is retained.
Regards,
SungJun Jang
2026년 5월 14일 (목) 오전 10:56, Henson Choi <assam258(at)gmail(dot)com>님이 작성:
> Hi Michael, SungJun,
>
> One framing point first.
>
> Correctness of the rewritten routines is the easy part -- for valid
> EUC-KR data there is no behavior change. What deserves scrutiny is
> the side-effect surface: callers that have been getting mblen=2 or 3
> for 0x8E/0x8F, and the fact that such bytes can already be on disk
> via routes that skip pg_verify_mbstr. After the patch those rows
> are measured differently by length/substring/LIKE.
>
> Suggested review principles:
>
> (P1) Indirect-entry completeness -- pg_wchar_table[] should be
> confirmed as the only callback registry into the EUC-KR slots.
> (P2) Direct-call review -- callers of pg_mblen / pg_dsplen /
> pg_encoding_mblen / pg_verify_mbstr / pg_mb*cliplen /
> pg_encoding_max_length and their cstring variants should be
> enumerated and each call site checked for any reliance on
> the old mblen=2/3 return for 0x8E/0x8F.
> (P3) Verify-bypass:
> (a) call-time -- code that runs mblen/dsplen on bytes not
> verified in the same call chain (copyto.c file_encoding
> scan, cstring pg_mblen in formatting.c / varlena.c) should
> be identified.
> (b) load-time -- byte-preserving carry-over that skips
> pg_verify_mbstr (pg_upgrade relfilenode preservation,
> physical replication / WAL replay, pre-patch on-disk data
> from earlier versions) should be reasoned about.
> (P4) Cross-encoding contamination -- it should be confirmed that no
> Japanese-shaped assumption (SS2/SS3, JIS X 0201/0212,
> pg_eucjp_increment, shared pg_euc_* helpers) can reach EUC-KR
> data after the split.
> (P5) User-visible width -- pg_encoding_max_length('EUC_KR') changes
> from 3 to 2; buffer-sizing / width-bounded consumers should be
> checked.
>
> Review-coverage list (not a test matrix):
>
> Encoding-layer surface (P5)
> - pg_wchar_table[PG_EUC_KR].maxmblen and readers of
> pg_encoding_max_length() that switch on EUC-KR width
> - pg_database_encoding_max_length() callers in mbutils.c
>
> Direct callers of pg_mblen / pg_dsplen (P2)
> - varlena.c: text_substring, text_position, text_left/right,
> text_reverse, length-family
> - oracle_compat.c: lpad/rpad/ltrim/rtrim/btrim/translate
> - formatting.c: to_char width handling (pg_mblen_cstr)
> - varchar.c: bpchar / varchar length enforcement
> (pg_mbcharcliplen)
> - xml.c, jsonfuncs.c: pg_mblen on textual content
>
> Pattern matching (P2)
> - like.c: multibyte LIKE
>
> Call-time verify-bypass (P3a)
> - copyto.c: file_encoding delimiter scan
> - formatting.c / varlena.c: cstring pg_mblen callers
>
> Load-time verify-bypass (P3b) -- highest follow-up priority
> - pg_upgrade: relfilenode preservation (bytes as-is)
> - Physical replication / WAL replay
> - Pre-patch on-disk data from earlier versions
>
> Pre-existing-data behavior (P3b consequence)
> - length/char_length/octet_length, substring/left/right/position,
> LIKE / regex on rows containing 0x8E/0x8F
> - Sort order / btree text indexes built before the patch
> (bytewise comparator, likely safe but worth naming)
>
> tsearch (P2)
> - spell.c, ts_locale.c, wparser_def.c, dict_thesaurus.c
>
> Conversion proc sanity
> - utf8_and_euc_kr (table-driven, expected unchanged)
>
> These are the items I think are worth a line or two of reasoning
> from someone who has read the code, though I don't think they
> need to block this patch -- treating the review-coverage list as
> background and following up separately is a perfectly fine
> outcome. I plan to work through them myself gradually, as time
> allows, and share what I find along the way; other eyes would
> be welcome.
>
> ----------------------------------------------------------------------
>
> On 2026-05-13, Michael Paquier wrote:
>
> > I unfortunately cannot read the PNG file [...]
>
> I read KS X 2901 clause 4.2 directly in Korean; SungJun's later
> transcription matches the normative text verbatim.
>
> (On src/test/locale/: agreed with SungJun -- a separate concern
> from this patch.)
>
> > - the proposed patch has zero regression tests.
>
> Agreed on the coverage gap. SungJun's proposed 0001 baseline
> tests (pg_encoding_max_length, G0/ASCII handling, G1/KS X 1001
> handling, SS2 rejection) are a sensible starting point and worth
> adopting as-is. Beyond that, the review-coverage list above is
> what I plan to walk as follow-up: some items will naturally
> suggest further regression tests, while others (notably the
> load-time bypass paths in P3b) can only be reasoned about in
> review. Any additional tests can come out of that pass.
>
> ----------------------------------------------------------------------
>
> On 2026-05-13, SungJun Jang wrote (v2):
>
> > v2 attached. [commit message and release note updated]
>
> Commit-message update LGTM. On the release-19.sgml hunk: release
> notes are curated by the release manager from commit messages
> after the fact, not written into the patch. It would probably
> be cleaner for v3 to omit that hunk and let the commit message
> carry the user-visible change instead (pg_encoding_max_length
> 3 -> 2, plus the pre-existing 0x8E/0x8F behavior shift via the
> P3b routes). The charset.sgml Table 23.3 fix should stay --
> that is documented behavior, not a release note.
>
> ----------------------------------------------------------------------
>
> On 2026-05-13, SungJun Jang wrote (reply to Michael):
>
> > [KS X 2901 clause text quoted in Korean and English]
>
> Confirmed -- both match standard.go.kr.
>
> > I am planning to split into two patches:
> > 0001 -- Baseline capture [...]
> > 0002 -- Actual fix [...]
>
> The split is a nice shape for review -- characterization-first
> makes the behavior delta easy to see, and "exactly one line moves"
> is a useful audit signal. Whether to keep the split or squash at
> commit time is a committer call.
>
> Regards,
> Henson Choi
>
| Attachment | Content-Type | Size |
|---|---|---|
| v3-0001-Make-EUC-KR-encoding-routines-self-contained.patch | application/octet-stream | 4.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | jian he | 2026-05-14 04:36:56 | Re: [PATCH] Rebuild CHECK constraints after generated column SET EXPRESSION |
| Previous Message | Chao Li | 2026-05-14 04:14:54 | Should IGNORE NULLS cache nullness for volatile arguments? |