Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

From: Zhongpu Chen <chenloveit(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: peter(at)eisentraut(dot)org, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date: 2026-05-10 02:28:57
Message-ID: CA+1gyq+KeNhn=ZR6MZap49e8NX984O2z2FFoY_2dpmnMFL7a9w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My prototype implementation:
https://github.com/SWUFE-DB-Group/postgresql-encoding-validation and the
usage:
https://github.com/SWUFE-DB-Group/postgresql-encoding-validation/blob/main/DEV.md

On Sat, May 9, 2026 at 4:58 PM Zhongpu Chen <chenloveit(at)gmail(dot)com> wrote:

> > If so, tightening up the validation may break such that uses.
>
> I agree. What about introducing an extra GUC which allows users to specify
> verification logic? In fact, I have implemented this patch.
>
> ```
> SHOW encoding_validation;
> -- default behaviour
> SET encoding_validation = 'native';
> -- enforce Write to be fully compatible with Read
> SET encoding_validation = 'read_compatible';
> ```
>
> On Wed, May 6, 2026 at 8:19 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>
>> > It is in general not necessarily required that all text in all
>> > non-UTF8 encodings must be convertible to UTF8.
>> >
>> > (This is also a result of history: These encodings were implemented in
>> > PostgreSQL before Unicode.)
>> >
>> > That said, I can see how different behaviors might be desirable.
>> >
>> > My first question would be, are these non-convertible byte sequences
>> > just characters that don't map to Unicode, or are they invalid within
>> > the definition of the EUC-* encodings themselves?
>>
>> A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
>> Roman numerals (iii), which is not defined in the original GB2312
>> (the character set of EUC_CN),
>>
>> > If the latter, then
>> > we should just reject them (modulo some backward compatibility),
>> > similar to how we reject certain Unicode code points that exist
>> > "structurally" but are not valid for one reason or another.
>>
>> After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
>> super set of GB2312). In DB18030, lowercase forms of the Roman
>> numerals and other characters (e.g. Euro sign) were added.
>>
>> So I suspect that a) those characters are sometimes used with EUC_CN
>> despite the fact that they are not valid GB2312 characters. b) some
>> EUC_CN users might have already written those characters into EUC_CN
>> databases. If so, tightening up the validation may break such that
>> uses. This is just my guess. Please correct me if I am wrong.
>>
>> > Alternatively, if these byte sequences are valid characters but they
>> > just didn't end up in Unicode for some reason, then rejecting them
>> > might break valid uses.
>>
>> That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
>> explicitly rejects characters that are not part of GB2312, including
>> 0xA2A3, as the script is using GB18030 as a source data.
>>
>> Regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>
>
> --
> Zhongpu Chen
>

--
Zhongpu Chen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-05-10 02:54:37 Re: Row pattern recognition
Previous Message Álvaro Herrera 2026-05-09 22:38:08 Re: Fix REPACK with WITHOUT OVERLAPS replica identity indexes