Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

From: Zhongpu Chen <chenloveit(at)gmail(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date: 2026-05-06 09:15:18
Message-ID: CA+1gyqK5FuJUG6rDRaYXZBcPd=Wn2fuy9Qhp5X_DT=J3a2HMAA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I agree that not every valid character encoded in a legacy non-UTF8
encoding is necessarily required to be convertible to UTF8. But this
assumes that the byte sequence actually denotes a valid character in the
declared legacy encoding.

For the reported EUC-CN cases, this is exactly the point in question. These
byte sequences are structurally well-formed EUC-CN byte pairs, but they
fall into reserved or unassigned positions of the GB2312 code table. For
example, byte sequences with first byte 0xAA correspond to row 10 of
GB2312, which is reserved/unassigned. Therefore, these cases are not merely
valid legacy characters that happen to lack Unicode mappings. Rather, under
strict GB2312/EUC-CN semantics, they are not assigned to any character at
all, and thus should not be considered valid GB2312 characters.

So my concern is not that every legacy-encoded character must be
convertible to UTF8. The concern is that PostgreSQL's write-time validation
accepts a structural superset of EUC-CN byte pairs as text, while some of
these byte pairs are not valid assigned GB2312 characters and PostgreSQL's
own later conversion path cannot assign character semantics to them.

BTW, as noted in MySQL's implementation, a finer checker is possible.

On Wed, May 6, 2026 at 3:32 PM Peter Eisentraut <peter(at)eisentraut(dot)org>
wrote:

> On 02.05.26 04:31, Zhongpu Chen wrote:
> > See the related bug report https://www.postgresql.org/message-id/
> > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com
> > <https://www.postgresql.org/message-id/
> > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com>
> >
> > Currently PostgreSQL accepts structurally well-formed EUC_CN byte
> > sequences such as 0xA2A3 into text columns. The value round-trips when
> > client_encoding is EUC_CN, but fails when client_encoding is UTF8
> > because euc_cn_to_utf8 has no mapping.
> >
> > If this behavior is intentional for compatibility, the documentation
> > should explicitly say that validation for some legacy encodings is byte-
> > structure validation, not mapping-table validation.
> > If it is not intentional, stricter validation could reject unassigned
> > byte positions at input time.
>
> It is in general not necessarily required that all text in all non-UTF8
> encodings must be convertible to UTF8.
>
> (This is also a result of history: These encodings were implemented in
> PostgreSQL before Unicode.)
>
> That said, I can see how different behaviors might be desirable.
>
> My first question would be, are these non-convertible byte sequences
> just characters that don't map to Unicode, or are they invalid within
> the definition of the EUC-* encodings themselves? If the latter, then
> we should just reject them (modulo some backward compatibility), similar
> to how we reject certain Unicode code points that exist "structurally"
> but are not valid for one reason or another.
>
> Alternatively, if these byte sequences are valid characters but they
> just didn't end up in Unicode for some reason, then rejecting them might
> break valid uses.
>
> (I don't know much about EUC-* to be able to answer these.)
>
>

--
Zhongpu Chen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-05-06 09:18:38 Re: COPY JSON: use trailing commas in FORCE_ARRAY output
Previous Message Alex Guo 2026-05-06 08:46:33 Re: COPY JSON: use trailing commas in FORCE_ARRAY output