| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | ehrjs023(at)gmail(dot)com |
| Cc: | ishii(at)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges |
| Date: | 2026-07-01 07:49:11 |
| Message-ID: | CAAAe_zCQFoLw8p72Dqr2f2r071npX++WOByFqXFSi=u5aj2HPw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi DoGeon,
Thanks for picking this up, and nice first patch. I reviewed v1 (both
patches) and the change is correct. Below are one independent
confirmation, a sourcing note, and two optional test additions.
> the accept/reject outcome for any input is unchanged;
I checked this exhaustively rather than by spot check. Scanning the
full two-byte space against PostgreSQL's own uhc_to_utf8.map (17,237
mapped sequences), the tightened accept set (lead 0x81-0xFE; trail
0x41-0x5A, 0x61-0x7A, 0x81-0xFE) is a strict superset of every mapped
sequence -- zero real mappings fall in the newly-rejected ranges. So
nothing that decodes today stops decoding; only the eight structurally
invalid pairs move to the correct error.
> - Microsoft CP949 (Windows-949) specifies the two-byte form as
> lead 0x81-0xFE, trail 0x41-0x5A | 0x61-0x7A | 0x81-0xFE.
Right -- and even the WHATWG Encoding Standard's euc-kr (= CP949) decoder
takes a wider trail, 0x41-0x7E and 0x81-0xFE. Side by side:
Rule Lead Trail
------------------ --------- -------------------------------
Old verifier 0x80-0xFF any byte but 0x00
WHATWG (CP949) 0x81-0xFE 0x41-0x7E, 0x81-0xFE
CP949 / this patch 0x81-0xFE 0x41-0x5A, 0x61-0x7A, 0x81-0xFE
Your rule matches the actual CP949 assignment and is even tighter than
the WHATWG structural envelope, rejecting the gaps at verify time.
Two optional test cases would close the last coverage gaps in uhc.sql
(neither blocks commit):
-- accept: upper lead boundary 0xFE. Today 0xFE appears only as a
-- trail byte, so the `c1 > 0xfe` bound is never exercised.
SELECT encode(convert_to(convert_from('\xfea1', 'UHC'), 'UTF8'), 'hex');
-- -> ee819e
-- reject: trail 0x00, the sole trail the old verifier also rejected.
SELECT convert_from('\x8100', 'UHC'); -- 0x00
-- -> ERROR: invalid byte sequence for encoding "UHC": 0x81 0x00
Both pass with the patch applied. With those folded in, this looks ready
to me.
Thanks again,
Henson
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ewan Young | 2026-07-01 07:52:51 | Fix jsonpath .decimal() to honor silent mode |
| Previous Message | shveta malik | 2026-07-01 07:46:19 | Re: Proposal: Conflict log history table for Logical Replication |