| From: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
|---|---|
| To: | thomas(dot)munro(at)gmail(dot)com |
| Cc: | andreas(at)proxel(dot)se, pgsql-hackers(at)lists(dot)postgresql(dot)org, assam258(at)gmail(dot)com |
| Subject: | Re: Questionable description about character sets |
| Date: | 2026-04-17 01:28:24 |
| Message-ID: | 20260417.102824.927096962510122248.ishii@postgresql.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> If we wanted to follow the SQL standard's terminology, I think we'd
> call this the "character repertoire".
Calling it "character repertoire" works for me. Fortunately the
meaning of "character repertoire" in the SQL standard and in other
standard (ISO/IEC 2022 or 10646) looks same.
> In the standard, a "character
> set" is the database object representing a repertoire and an encoding
> of it, or its identifier.
Yes. Unlike ISO/IEC 2022 or 10646, the SQL standard has no clear
distinction between character set (in the sense of ISO/IEC 10646) and
encoding. (To me this is quite confusing.)
> But if we put it in the description column,
> we wouldn't have to name it.
Why?
> Researching the standard led me to
> src/backend/catalog/information_schema.sql[1]. It currently reports
> the encoding name as the character set and the repertoire, except
> s/UTF8/UCS/ for the repertoire. That's the same information as you
> want to document here. For the character set (in the SQL standard
> sense), the current view definition seems reasonable given that we
> don't support CREATE CHARACTER SET or CHARACTER SET generally,
Why? For example, Shouldn't EUC_JP have JIS X 0201, JIS X 0208 and JIS
X 0212 as its character repertoire?
> and for
> the character repertoire, the s/UTF8/UCS/ translation makes sense, but
> you chose to call it "Unicode". Shouldn't those agree?
I think "UCS" is not a repertoire, but a coded character set.
"Unicode" or "Unicode repertoire" [1] is more appropreate, I think.
[1] https://www.unicode.org/reports/tr17/tr17-3.html
> If GB18030 were a valid server encoding, it would surely have to
> report UCS, like UTF8, since it is also a "Unicode transformation
> format"[2] (its purpose is to be backwards compatible with legacy
> 2-byte-per-common-Chinese-character formats while also covering all of
> Unicode 100% systematically, ie booting stuff they don't often encode
> into the 3- and 4-byte zone to make room for efficient encoding of
> stuff they do often encode). So I think that means your new
> documentation should say UCS (or UNICODE) for that one too.
Not sure. I heard that the latest GB18030 (GB18030-2022, at this
point) does not contain some newer Unicode characters.
> I don't
> know how other encodings should spell their repertoire though...
Need research for me too.
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Japin Li | 2026-04-17 01:40:07 | Re: Reject invalid databases in pg_get_database_ddl() |
| Previous Message | Josh Kupershmidt | 2026-04-17 00:47:00 | pg_dump: eliminate tmpfile double-write in tar format output |