Re: Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges

From: Henson Choi <assam258(at)gmail(dot)com>
To: ehrjs023(at)gmail(dot)com
Cc: ishii(at)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges
Date: 2026-07-01 07:49:11
Message-ID: CAAAe_zCQFoLw8p72Dqr2f2r071npX++WOByFqXFSi=u5aj2HPw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi DoGeon,

Thanks for picking this up, and nice first patch. I reviewed v1 (both
patches) and the change is correct. Below are one independent
confirmation, a sourcing note, and two optional test additions.

> the accept/reject outcome for any input is unchanged;

I checked this exhaustively rather than by spot check. Scanning the
full two-byte space against PostgreSQL's own uhc_to_utf8.map (17,237
mapped sequences), the tightened accept set (lead 0x81-0xFE; trail
0x41-0x5A, 0x61-0x7A, 0x81-0xFE) is a strict superset of every mapped
sequence -- zero real mappings fall in the newly-rejected ranges. So
nothing that decodes today stops decoding; only the eight structurally
invalid pairs move to the correct error.

> - Microsoft CP949 (Windows-949) specifies the two-byte form as
> lead 0x81-0xFE, trail 0x41-0x5A | 0x61-0x7A | 0x81-0xFE.

Right -- and even the WHATWG Encoding Standard's euc-kr (= CP949) decoder
takes a wider trail, 0x41-0x7E and 0x81-0xFE. Side by side:

Rule Lead Trail
------------------ --------- -------------------------------
Old verifier 0x80-0xFF any byte but 0x00
WHATWG (CP949) 0x81-0xFE 0x41-0x7E, 0x81-0xFE
CP949 / this patch 0x81-0xFE 0x41-0x5A, 0x61-0x7A, 0x81-0xFE

Your rule matches the actual CP949 assignment and is even tighter than
the WHATWG structural envelope, rejecting the gaps at verify time.

Two optional test cases would close the last coverage gaps in uhc.sql
(neither blocks commit):

-- accept: upper lead boundary 0xFE. Today 0xFE appears only as a
-- trail byte, so the `c1 > 0xfe` bound is never exercised.
SELECT encode(convert_to(convert_from('\xfea1', 'UHC'), 'UTF8'), 'hex');
-- -> ee819e

-- reject: trail 0x00, the sole trail the old verifier also rejected.
SELECT convert_from('\x8100', 'UHC'); -- 0x00
-- -> ERROR: invalid byte sequence for encoding "UHC": 0x81 0x00

Both pass with the patch applied. With those folded in, this looks ready
to me.

Thanks again,
Henson

>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ewan Young 2026-07-01 07:52:51 Fix jsonpath .decimal() to honor silent mode
Previous Message shveta malik 2026-07-01 07:46:19 Re: Proposal: Conflict log history table for Logical Replication