Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: peter(at)eisentraut(dot)org
Cc: chenloveit(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date: 2026-05-06 12:19:07
Message-ID: 20260506.211907.1578384907621261702.ishii@postgresql.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> It is in general not necessarily required that all text in all
> non-UTF8 encodings must be convertible to UTF8.
>
> (This is also a result of history: These encodings were implemented in
> PostgreSQL before Unicode.)
>
> That said, I can see how different behaviors might be desirable.
>
> My first question would be, are these non-convertible byte sequences
> just characters that don't map to Unicode, or are they invalid within
> the definition of the EUC-* encodings themselves?

A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
Roman numerals (iii), which is not defined in the original GB2312
(the character set of EUC_CN),

> If the latter, then
> we should just reject them (modulo some backward compatibility),
> similar to how we reject certain Unicode code points that exist
> "structurally" but are not valid for one reason or another.

After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
super set of GB2312). In DB18030, lowercase forms of the Roman
numerals and other characters (e.g. Euro sign) were added.

So I suspect that a) those characters are sometimes used with EUC_CN
despite the fact that they are not valid GB2312 characters. b) some
EUC_CN users might have already written those characters into EUC_CN
databases. If so, tightening up the validation may break such that
uses. This is just my guess. Please correct me if I am wrong.

> Alternatively, if these byte sequences are valid characters but they
> just didn't end up in Unicode for some reason, then rejecting them
> might break valid uses.

That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
explicitly rejects characters that are not part of GB2312, including
0xA2A3, as the script is using GB18030 as a source data.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Feike Steenbergen 2026-05-06 12:24:38 Re: BUG: ReadStream look-ahead exhausts local buffers when effective_io_concurrency>=64
Previous Message Alexandra Wang 2026-05-06 12:19:05 Re: Is there value in having optimizer stats for joins/foreignkeys?