| From: | Zhongpu Chen <chenloveit(at)gmail(dot)com> |
|---|---|
| To: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> |
| Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 |
| Date: | 2026-05-06 06:34:46 |
| Message-ID: | CA+1gyqLj1squrZMTu2WkjoqVBeyPx=dozfaOa5XydoBjvLp02w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
I run a benchmark to test the performance over a Chinese classic novel with
respect to various validation strategies:
https://github.com/SWUFE-DB-Group/NUAV/blob/main/encoding-validation/NUAV/src/gb2312.rs
The running log of `cargo bench -- gb2312`:
```
Running benches/gb2312.rs (target/release/deps/gb2312-53d8e01b8e6785c8)
gb2312::is_gb2312_iconv time: [2.5621 ms 2.5681 ms 2.5740 ms]
change: [-2.6404% -2.3284% -2.0023%] (p = 0.00 <
0.05)
Performance has improved.
gb2312::is_gb2312_icu time: [3.2427 ms 3.2589 ms 3.2771 ms]
change: [-1.5710% -1.0409% -0.4387%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
gb2312::is_gb2312_rs time: [2.8157 ms 2.8229 ms 2.8303 ms]
change: [-1.6985% -1.2165% -0.7501%] (p = 0.00 <
0.05)
Change within noise threshold.
Benchmarking gb2312::is_gb2312_range: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 8.3s, enable flat sampling, or reduce sample count to 50.
gb2312::is_gb2312_range time: [1.6237 ms 1.6294 ms 1.6351 ms]
change: [+3.8720% +4.2901% +4.6933%] (p = 0.00 <
0.05)
Performance has regressed.
gb2312::is_gb2312_lookup
time: [488.12 µs 490.04 µs 491.97 µs]
change: [+0.9273% +2.2343% +3.2599%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild
gb2312::is_gb2312_simd time: [181.00 µs 181.77 µs 182.53 µs]
change: [-4.4563% -3.6971% -3.0260%] (p = 0.00 <
0.05)
Performance has improved.
gb2312:is_gb2312_ranges_pg
time: [467.69 µs 469.27 µs 470.82 µs]
Benchmarking gb2312:is_gb2312_ranges_mysql: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 6.4s, enable flat sampling, or reduce sample count to 60.
gb2312:is_gb2312_ranges_mysql
time: [1.2611 ms 1.2667 ms 1.2724 ms]
```
As we can see, the PG-style validation does not bring much improvement.
Instead, it is slower than my strict-styles.
On Sat, May 2, 2026 at 12:49 PM Zhongpu Chen <chenloveit(at)gmail(dot)com> wrote:
> Thanks for the clarification.
>
>
> I agree that validation on every input may have runtime-cost concerns. But
> this can be well-controlled. For example, MySQL adopts a finer checking for
> EUC-CN (i.e., GB2312) in
> https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:
>
>
> ```
>
> static int func_gb2312_uni_onechar(int code) {
> if ((code >= 0x2121) && (code <= 0x2658))
> return (tab_gb2312_uni0[code - 0x2121]);
> if ((code >= 0x2721) && (code <= 0x296F))
> return (tab_gb2312_uni1[code - 0x2721]);
> if ((code >= 0x3021) && (code <= 0x777E))
> return (tab_gb2312_uni2[code - 0x3021]);
> return (0);
> }
>
> ```
>
> where `code` is obtained by subtracting 0x8080. Of course, MySQL's
> checking can also be enhanced.
>
>
> Anyway, it is reasonable to note these details in the documentation.
>
>
> On Sat, May 2, 2026 at 11:28 AM David G. Johnston <
> david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>
>> On Friday, May 1, 2026, Zhongpu Chen <chenloveit(at)gmail(dot)com> wrote:
>>
>>> The issue is not specific to E'\\x..' literals. A normal COPY FROM data
>>> file with ENCODING 'EUC_CN' can create text rows that later cannot be
>>> retrieved with SELECT.
>>>
>>> This suggests that input validation for EUC_CN is only structural,
>>> while the EUC_CN-to-UTF8 conversion table is stricter.
>>>
>>
>> I suspect a lack of desire to maintain and ensure that specific values
>> are verified; or accepting the runtime cost to do so. It is indeed
>> structural. This point should probably be documented better. But it’s
>> hard to feel too bad if the input claims it is providing verifiable EUC_CN
>> data then proceeds to supply data that lacks meaning in reality. We are
>> happy to just store and return your data to you - but it’s unreasonable to
>> ask for it to be converted. It would be nice for the database to provide
>> an extra layer of protection, so I’m not against the idea. Either
>> automatically or or at least providing a function that could, say, be
>> called in a trigger for opt-in. But definitely feels like a problematic
>> benefit-to-cost proposition.
>>
>> David J.
>>
>>
>
> --
> Zhongpu Chen
>
--
Zhongpu Chen
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-05-06 06:40:24 | COPY JSON: use trailing commas in FORCE_ARRAY output |
| Previous Message | Jim Jones | 2026-05-06 06:33:29 | Re: VACUUM FULL, CLUSTER, and REPACK block on other sessions' temp tables |