Re: Questionable description about character sets

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: thomas(dot)munro(at)gmail(dot)com
Cc: andreas(at)proxel(dot)se, pgsql-hackers(at)lists(dot)postgresql(dot)org, assam258(at)gmail(dot)com
Subject: Re: Questionable description about character sets
Date: 2026-04-17 01:28:24
Message-ID: 20260417.102824.927096962510122248.ishii@postgresql.org
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> If we wanted to follow the SQL standard's terminology, I think we'd
> call this the "character repertoire".

Calling it "character repertoire" works for me. Fortunately the
meaning of "character repertoire" in the SQL standard and in other
standard (ISO/IEC 2022 or 10646) looks same.

> In the standard, a "character
> set" is the database object representing a repertoire and an encoding
> of it, or its identifier.

Yes. Unlike ISO/IEC 2022 or 10646, the SQL standard has no clear
distinction between character set (in the sense of ISO/IEC 10646) and
encoding. (To me this is quite confusing.)

> But if we put it in the description column,
> we wouldn't have to name it.

Why?

> Researching the standard led me to
> src/backend/catalog/information_schema.sql[1]. It currently reports
> the encoding name as the character set and the repertoire, except
> s/UTF8/UCS/ for the repertoire. That's the same information as you
> want to document here. For the character set (in the SQL standard
> sense), the current view definition seems reasonable given that we
> don't support CREATE CHARACTER SET or CHARACTER SET generally,

Why? For example, Shouldn't EUC_JP have JIS X 0201, JIS X 0208 and JIS
X 0212 as its character repertoire?

> and for
> the character repertoire, the s/UTF8/UCS/ translation makes sense, but
> you chose to call it "Unicode". Shouldn't those agree?

I think "UCS" is not a repertoire, but a coded character set.
"Unicode" or "Unicode repertoire" [1] is more appropreate, I think.

[1] https://www.unicode.org/reports/tr17/tr17-3.html

> If GB18030 were a valid server encoding, it would surely have to
> report UCS, like UTF8, since it is also a "Unicode transformation
> format"[2] (its purpose is to be backwards compatible with legacy
> 2-byte-per-common-Chinese-character formats while also covering all of
> Unicode 100% systematically, ie booting stuff they don't often encode
> into the 3- and 4-byte zone to make room for efficient encoding of
> stuff they do often encode). So I think that means your new
> documentation should say UCS (or UNICODE) for that one too.

Not sure. I heard that the latest GB18030 (GB18030-2022, at this
point) does not contain some newer Unicode characters.

> I don't
> know how other encodings should spell their repertoire though...

Need research for me too.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Japin Li 2026-04-17 01:40:07 Re: Reject invalid databases in pg_get_database_ddl()
Previous Message Josh Kupershmidt 2026-04-17 00:47:00 pg_dump: eliminate tmpfile double-write in tar format output