Re: Do we still need MULE_INTERNAL?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Do we still need MULE_INTERNAL?
Date: 2026-02-11 14:06:29
Message-ID: CA+hUKGK4ZvZYNRC_W10dT2W6TYBY24q=B-EfKpUL50v2E3U6_w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 11, 2026 at 7:52 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> Thank you for the report. I find it is quite useful, especially the
> Emacs 23 internal (new to me). I agree that MULE_INTERNAL has
> fulfilled its historic role.

Thanks Ishii-san and Tom. Here's a patch. Obviously it mostly just
deletes thousands of lines, but also: I had to preserve the encoding
number, so there's a hole in the table, and I had to think of a new
name for cyrillic_and_mic.c, so I went with cyrillic.c because it
handles 4 single-byte encodings and it wasn't clear how to fit into
the existing x_and_y pattern (ie which two to highlight arbitrarily in
the name).

> > Since there are two encodings for kana characters and MULE's
> > superpower is to switch, I guess it depends how you chose to encode it
> > and what your ratio of kana to kanji is.
>
> The reason for 2 encodings in MULE for "kana" exist is, it's a nature
> of the character sets mule supports. In Japanese there are 2 types of
> "kana", one is "hiragana" and the other is "katakana". JIS X0208/0212
> includes both types of "kana", while JIS X0201 includes only
> "katakana". So why "katakana" appears on those two encodings? Katakana
> in JIS X0201 is often rendered on screen in half width comparing with
> JIS X 0208 and 0212. Some users find this beneficial.

Ah, right, I see. And judging by Wikipedia's article on half-width
katakana, it sounds like any scenario where it's mixed with hiragana
and kanji would probably not use them anyway, so perhaps 3 is a better
guess. In other words, MULE_INTERNAL databases would probably not get
bigger if reloaded as UTF-8.

> > UTF8: 3 3
>
> I thought some of JIS 2004 kanji are mapped to 4-byte UTF8 character.

Looks like it:

grep 'U+[0-9A-F][0-9A-F][0-9A-F][0-9A-F][0-9A-F].*\[200[04]\]' \
./src/backend/utils/mb/Unicode/euc-jis-2004-std.txt

They are in "CJK Unified Ideographs Extension B" for "rare and
historic CJK ideographs", so I guess they wouldn't matter much, but in
any case we're talking about a hypothetical user moving from
MULE_INTERNAL, which *doesn't* have JIS 2004. I think the older
standards are entirely in the basic plane, so only 1-3-byte UTF-8
sequences.

. o O ( UTF-16 would probably be the ideal storage for CJK text if we
could do it... )

Attachment Content-Type Size
v1-0001-Remove-MULE_INTERNAL-encoding.patch application/x-patch 154.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chengpeng Yan 2026-02-11 14:17:01 Re: Unfortunate pushing down of expressions below sort
Previous Message Nazir Bilal Yavuz 2026-02-11 13:27:50 Re: Speed up COPY FROM text/CSV parsing using SIMD