| From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
|---|---|
| To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Subject: | Re: Do we still need MULE_INTERNAL? |
| Date: | 2026-02-11 14:06:29 |
| Message-ID: | CA+hUKGK4ZvZYNRC_W10dT2W6TYBY24q=B-EfKpUL50v2E3U6_w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Feb 11, 2026 at 7:52 PM Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> Thank you for the report. I find it is quite useful, especially the
> Emacs 23 internal (new to me). I agree that MULE_INTERNAL has
> fulfilled its historic role.
Thanks Ishii-san and Tom. Here's a patch. Obviously it mostly just
deletes thousands of lines, but also: I had to preserve the encoding
number, so there's a hole in the table, and I had to think of a new
name for cyrillic_and_mic.c, so I went with cyrillic.c because it
handles 4 single-byte encodings and it wasn't clear how to fit into
the existing x_and_y pattern (ie which two to highlight arbitrarily in
the name).
> > Since there are two encodings for kana characters and MULE's
> > superpower is to switch, I guess it depends how you chose to encode it
> > and what your ratio of kana to kanji is.
>
> The reason for 2 encodings in MULE for "kana" exist is, it's a nature
> of the character sets mule supports. In Japanese there are 2 types of
> "kana", one is "hiragana" and the other is "katakana". JIS X0208/0212
> includes both types of "kana", while JIS X0201 includes only
> "katakana". So why "katakana" appears on those two encodings? Katakana
> in JIS X0201 is often rendered on screen in half width comparing with
> JIS X 0208 and 0212. Some users find this beneficial.
Ah, right, I see. And judging by Wikipedia's article on half-width
katakana, it sounds like any scenario where it's mixed with hiragana
and kanji would probably not use them anyway, so perhaps 3 is a better
guess. In other words, MULE_INTERNAL databases would probably not get
bigger if reloaded as UTF-8.
> > UTF8: 3 3
>
> I thought some of JIS 2004 kanji are mapped to 4-byte UTF8 character.
Looks like it:
grep 'U+[0-9A-F][0-9A-F][0-9A-F][0-9A-F][0-9A-F].*\[200[04]\]' \
./src/backend/utils/mb/Unicode/euc-jis-2004-std.txt
They are in "CJK Unified Ideographs Extension B" for "rare and
historic CJK ideographs", so I guess they wouldn't matter much, but in
any case we're talking about a hypothetical user moving from
MULE_INTERNAL, which *doesn't* have JIS 2004. I think the older
standards are entirely in the basic plane, so only 1-3-byte UTF-8
sequences.
. o O ( UTF-16 would probably be the ideal storage for CJK text if we
could do it... )
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Remove-MULE_INTERNAL-encoding.patch | application/x-patch | 154.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chengpeng Yan | 2026-02-11 14:17:01 | Re: Unfortunate pushing down of expressions below sort |
| Previous Message | Nazir Bilal Yavuz | 2026-02-11 13:27:50 | Re: Speed up COPY FROM text/CSV parsing using SIMD |