Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

From: Zhongpu Chen <chenloveit(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date: 2026-05-02 02:31:12
Message-ID: CA+1gyqJJJDhq=cc_D0ad59WH_OD2G_mN54xTru0KYoNaLkF48Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

See the related bug report
https://www.postgresql.org/message-id/CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com

Currently PostgreSQL accepts structurally well-formed EUC_CN byte sequences
such as 0xA2A3 into text columns. The value round-trips when
client_encoding is EUC_CN, but fails when client_encoding is UTF8 because
euc_cn_to_utf8 has no mapping.

If this behavior is intentional for compatibility, the documentation should
explicitly say that validation for some legacy encodings is byte-structure
validation, not mapping-table validation.
If it is not intentional, stricter validation could reject unassigned byte
positions at input time.

--
Zhongpu Chen

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhongpu Chen 2026-05-02 02:39:26 Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Previous Message Chao Li 2026-05-02 01:55:30 Re: Refactor code around GUC default_toast_compression