Quick Links

Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

From:	Zhongpu Chen <chenloveit(at)gmail(dot)com>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	peter(at)eisentraut(dot)org, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date:	2026-05-09 08:58:09
Message-ID:	CA+1gyqJW8ht=GEoxARAL=8pUGbq7qw7VV4eP+g6PK9f+Qi_TXg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> If so, tightening up the validation may break such that uses.

I agree. What about introducing an extra GUC which allows users to specify
verification logic? In fact, I have implemented this patch.

```
SHOW encoding_validation;
-- default behaviour
SET encoding_validation = 'native';
-- enforce Write to be fully compatible with Read
SET encoding_validation = 'read_compatible';
```

On Wed, May 6, 2026 at 8:19 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:

> > It is in general not necessarily required that all text in all
> > non-UTF8 encodings must be convertible to UTF8.
> >
> > (This is also a result of history: These encodings were implemented in
> > PostgreSQL before Unicode.)
> >
> > That said, I can see how different behaviors might be desirable.
> >
> > My first question would be, are these non-convertible byte sequences
> > just characters that don't map to Unicode, or are they invalid within
> > the definition of the EUC-* encodings themselves?
>
> A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
> Roman numerals (iii), which is not defined in the original GB2312
> (the character set of EUC_CN),
>
> > If the latter, then
> > we should just reject them (modulo some backward compatibility),
> > similar to how we reject certain Unicode code points that exist
> > "structurally" but are not valid for one reason or another.
>
> After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
> super set of GB2312). In DB18030, lowercase forms of the Roman
> numerals and other characters (e.g. Euro sign) were added.
>
> So I suspect that a) those characters are sometimes used with EUC_CN
> despite the fact that they are not valid GB2312 characters. b) some
> EUC_CN users might have already written those characters into EUC_CN
> databases. If so, tightening up the validation may break such that
> uses. This is just my guess. Please correct me if I am wrong.
>
> > Alternatively, if these byte sequences are valid characters but they
> > just didn't end up in Unicode for some reason, then rejecting them
> > might break valid uses.
>
> That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
> explicitly rejects characters that are not part of GB2312, including
> 0xA2A3, as the script is using GB18030 as a source data.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>

--
Zhongpu Chen

In response to

Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 at 2026-05-06 12:19:07 from Tatsuo Ishii

Responses

Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 at 2026-05-10 02:28:57 from Zhongpu Chen

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Etsuro Fujita	2026-05-09 09:15:44	Re: First draft of PG 19 release notes
Previous Message	Chao Li	2026-05-09 08:36:34	Re: Fix REPACK with WITHOUT OVERLAPS replica identity indexes