| From: | Zhongpu Chen <chenloveit(at)gmail(dot)com> |
|---|---|
| To: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> |
| Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 |
| Date: | 2026-05-02 04:49:00 |
| Message-ID: | CA+1gyqJwhQ5n4VZmJdnouaq7yMgYR+w_RiY=A6VWz4TzcUiHkw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Thanks for the clarification.
I agree that validation on every input may have runtime-cost concerns. But
this can be well-controlled. For example, MySQL adopts a finer checking for
EUC-CN (i.e., GB2312) in
https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:
```
static int func_gb2312_uni_onechar(int code) {
if ((code >= 0x2121) && (code <= 0x2658))
return (tab_gb2312_uni0[code - 0x2121]);
if ((code >= 0x2721) && (code <= 0x296F))
return (tab_gb2312_uni1[code - 0x2721]);
if ((code >= 0x3021) && (code <= 0x777E))
return (tab_gb2312_uni2[code - 0x3021]);
return (0);
}
```
where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking
can also be enhanced.
Anyway, it is reasonable to note these details in the documentation.
On Sat, May 2, 2026 at 11:28 AM David G. Johnston <
david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> On Friday, May 1, 2026, Zhongpu Chen <chenloveit(at)gmail(dot)com> wrote:
>
>> The issue is not specific to E'\\x..' literals. A normal COPY FROM data
>> file with ENCODING 'EUC_CN' can create text rows that later cannot be
>> retrieved with SELECT.
>>
>> This suggests that input validation for EUC_CN is only structural, while
>> the EUC_CN-to-UTF8 conversion table is stricter.
>>
>
> I suspect a lack of desire to maintain and ensure that specific values are
> verified; or accepting the runtime cost to do so. It is indeed
> structural. This point should probably be documented better. But it’s
> hard to feel too bad if the input claims it is providing verifiable EUC_CN
> data then proceeds to supply data that lacks meaning in reality. We are
> happy to just store and return your data to you - but it’s unreasonable to
> ask for it to be converted. It would be nice for the database to provide
> an extra layer of protection, so I’m not against the idea. Either
> automatically or or at least providing a function that could, say, be
> called in a trigger for opt-in. But definitely feels like a problematic
> benefit-to-cost proposition.
>
> David J.
>
>
--
Zhongpu Chen
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tatsuo Ishii | 2026-05-02 05:03:04 | Re: Row pattern recognition |
| Previous Message | Tatsuo Ishii | 2026-05-02 04:38:28 | Re: Row pattern recognition |