| From: | 유도건 <ehrjs023(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org, Henson Choi <assam258(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, thomas(dot)munro(at)gmail(dot)com |
| Subject: | Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges |
| Date: | 2026-06-05 02:20:26 |
| Message-ID: | CAFVBZ_GuA1SrRDqUNnCPzbCZGFvzC18+-0YQEKpAnJesut1xew@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
Per CP949 (Windows-949), a two-byte UHC sequence requires the lead
byte to be in 0x81-0xFE and the trail byte to be in 0x41-0x5A,
0x61-0x7A, or 0x81-0xFE.
pg_uhc_verifychar() in src/common/wchar.c accepts any lead byte
with the high bit set (0x80-0xFF) and any trail byte other than
NUL, without enforcing those ranges. Out-of-range pairs such as
0x80 0x41 (invalid lead) or 0x81 0x40 (invalid trail) are accepted
by the verifier and rejected only later by the conversion table,
with the message:
ERROR: character with byte sequence 0x80 0x41 in encoding "UHC"
has no equivalent in encoding "UTF8"
This is misleading -- those pairs are not unmappable, they are
structurally invalid in CP949 -- and it is inconsistent with
pg_euckr_verifychar() (src/common/wchar.c:1044), which already
enforces lead/trail byte ranges explicitly via IS_EUC_RANGE_VALID().
The following evidence supports tightening the UHC verifier:
- Microsoft CP949 (Windows-949) specifies the two-byte form as
lead 0x81-0xFE, trail 0x41-0x5A | 0x61-0x7A | 0x81-0xFE.
Other byte values are not valid for the two-byte form.
- PostgreSQL's own UHC -> UTF-8 conversion table is already built
on this assumption. The radix tree header in
src/backend/utils/mb/Unicode/uhc_to_utf8.map declares:
0x81, /* b2_1_lower */
0xfe, /* b2_1_upper */
0x41, /* b2_2_lower */
0xfe, /* b2_2_upper */
i.e. the conversion side already restricts the byte ranges and
rejects anything outside them; the verifier is just doing the
rejection in the wrong place with the wrong message.
- pg_euckr_verifychar() already follows the strict shape: it
validates lead/trail ranges directly rather than relying on
pg_uhc_mblen() + a NUL-only trail check. This patch brings
pg_uhc_verifychar() in line with it.
This is split into two patches to make the change visible:
0001 -- Add a regression test for UHC.
UHC is a client-only encoding, so there has been no dedicated
test for pg_uhc_verifychar(). This adds
src/test/regress/sql/uhc.sql, exercising the verifier through
convert_from() in a UTF8 database. The expected output records
the *current* behavior on master, so this patch applies cleanly
and all tests pass without any code change.
0002 -- Tighten pg_uhc_verifychar() to enforce CP949 byte ranges.
Rewrite pg_uhc_verifychar() to check lead range (0x81-0xFE) and
trail range (0x41-0x5A, 0x61-0x7A, or 0x81-0xFE) directly,
following the style of pg_euckr_verifychar(). The new
trail-range check also subsumes the previous NONUTF8_INVALID
sentinel check (0x8d 0x20), which is removed -- 0x20 is not in
any valid trail range, so 0x8d 0x20 is still rejected.
The diff in expected/uhc.out is exactly eight lines, all of the
form:
-ERROR: character with byte sequence 0xXX 0xYY in encoding
- "UHC" has no equivalent in encoding "UTF8"
+ERROR: invalid byte sequence for encoding "UHC": 0xXX 0xYY
No other test result changes. This makes the user-visible
effect of the fix self-evident:
- the accept/reject outcome for any input is unchanged;
- the error message format changes from "has no equivalent in
encoding UTF8" to "invalid byte sequence for encoding UHC"
for the eight previously misclassified pairs;
- rejection moves from the conversion step to the verifier,
which is the appropriate place for a structural check.
Only client-side paths are affected since UHC is not supported as
a server encoding.
This issue was reported by Henson Choi in [1].
[1]
https://postgr.es/m/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com
v1 patches attached.
Regards,
DoGeon Yoo
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Add-regression-test-for-UHC-encoding-baseline-capture.patch | application/octet-stream | 8.2 KB |
| v1-0002-Tighten-pg_uhc_verifychar-to-enforce-CP949-lead-trail.patch | application/octet-stream | 5.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Smith | 2026-06-05 02:29:04 | Re: Proposal: Conflict log history table for Logical Replication |
| Previous Message | Fujii Masao | 2026-06-05 00:53:53 | Re: Fix column privileges for pg_subscription.subwalrcvtimeout |