Re: Questionable description about character sets

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: andreas(at)proxel(dot)se, pgsql-hackers(at)lists(dot)postgresql(dot)org, Henson Choi <assam258(at)gmail(dot)com>
Subject: Re: Questionable description about character sets
Date: 2026-04-15 09:26:43
Message-ID: CA+hUKGJLCs7+8sW8ufY8WmiZzRhK+wtMEpe1-tJ6oyy2YEAQQg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 16, 2026 at 5:35 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Sat, Feb 14, 2026 at 11:20 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> > > Wouldn't that make the table very wide?
> >
> > I don't think it would make the table very wide but a little bit
> > wider. So I think adding the character sets information to
> > "Description" column is better. Some of encodings already have the
> > info. See attached patch.

If we wanted to follow the SQL standard's terminology, I think we'd
call this the "character repertoire". In the standard, a "character
set" is the database object representing a repertoire and an encoding
of it, or its identifier. But if we put it in the description column,
we wouldn't have to name it.

Researching the standard led me to
src/backend/catalog/information_schema.sql[1]. It currently reports
the encoding name as the character set and the repertoire, except
s/UTF8/UCS/ for the repertoire. That's the same information as you
want to document here. For the character set (in the SQL standard
sense), the current view definition seems reasonable given that we
don't support CREATE CHARACTER SET or CHARACTER SET generally, and for
the character repertoire, the s/UTF8/UCS/ translation makes sense, but
you chose to call it "Unicode". Shouldn't those agree?

If GB18030 were a valid server encoding, it would surely have to
report UCS, like UTF8, since it is also a "Unicode transformation
format"[2] (its purpose is to be backwards compatible with legacy
2-byte-per-common-Chinese-character formats while also covering all of
Unicode 100% systematically, ie booting stuff they don't often encode
into the 3- and 4-byte zone to make room for efficient encoding of
stuff they do often encode). So I think that means your new
documentation should say UCS (or UNICODE) for that one too. I don't
know how other encodings should spell their repertoire though...

(CC Henson Choi who might be interested in this topic especially WRT Korean.)

[1] https://www.postgresql.org/docs/current/infoschema-character-sets.html
[2] https://en.wikipedia.org/wiki/GB_18030

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Westermann (DWE) 2026-04-15 09:27:01 docs: Fix format of CREATE FOREIGN TABLE example in postgres_fdw
Previous Message Laurenz Albe 2026-04-15 08:57:25 Re: First draft of PG 19 release notes