Quick Links

Re: Patch: add conversion from pg_wchar to multibyte

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-09 04:15:46
Message-ID:	20120709.131546.2272132227508407100.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>>> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>>> that I can find.
>>>>>
>>>>> I also read in the xemacs internals doc, at
>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>>> charset codes. I don't know how we got to the current state of using
>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>>> with XEmacs.
>>>
>>>> At the time when mule internal code was introduced to PostgreSQL,
>>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>>> emacs) was the only implementation allowed to use multi encoding. So I
>>>> used the specification of mule documented in the URL I wrote.
>>>
>>> I see. Given that upstream has decided that a simpler definition is
>>> more appropriate, is there any reason not to follow their lead, to the
>>> extent that we can do so without breaking existing on-disk data?
>>
>> Please let me spend week end to understand the their latest spec.
>
> This is an intermediate report on the internal multi-byte charset
> implementation of emacen. I have read the link Tom showed. Also I made
> a quick scan on xemacs-21.4.0 source code, especially
> xemacs-21.4.0/src/mule-charset.h. It seems the web document is
> essentially a copy of the comments in the file. Also I looked into
> other place of xemacs code and I think I can conclude that xeamcs
> 21.4's multi-byte implementation is based on the doc on the web.
>
> Next I looked into emacs 24.1 source code because I could not find any
> doc regarding emacs's(not xemacs's) implementation of internal
> multi-byte charset. I found followings in src/charset.h:
>
> /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */
>
> And these are used like this:
>
> /* Read one non-ASCII character from INSTREAM. The character is
> encoded in `emacs-mule' and the first byte is already read in
> C. */
>
> static int
> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
> {
> :
> :
> else if (len == 3)
> {
> if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
> || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
> {
> charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
> code = buf[2] & 0x7F;
> }
>
> As far as I can tell, this is exactly the same way how PostgreSQL
> handles single private character sets: they consist of 3 bytes, and
> leading byte is either 0x9a or 0x9b. Other examples regarding single
> byte/multi-byte private charsets can be seen in coding.c.
>
> As far as I can tell, it seems emacs and xemacs employes different
> implementations of multi-byte charaset regarding "private"
> charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am
> contacting to the original Mule author if he knows anything about
> this.

I got reply from the Mule author, Kenichi Handa (the mail is in
Japanese. So I do not quote his mail here. If somebody wants to read
the original mail please let me know). First of all my understanding
with emacs's implementaion is correct according to him. He did not
know about xemacs's implementation. Apparently the implementation of
xemacs was not lead by the original mule author.

So which one of emacs/xemacs should we follow? My suggestion is, not
to follow xemacs, and to leave the current treatment of private
leading byte as it is because emacs seems to be more "right" upstream
comparing with xemacs.

> BTW, while looking into emacs's source code, I found their charset
> definitions are in lisp/international/mule-conf.el. According to the
> file several new charsets has been added. Included is the patch to
> follow their changes. This makes no changes to current behavior, since
> the patch just changes some comments and non supported charsets.

If there's no objection, I would like to commit this. Objection?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Re: Patch: add conversion from pg_wchar to multibyte at 2012-07-08 02:10:57 from Tatsuo Ishii

Responses

Re: Patch: add conversion from pg_wchar to multibyte at 2012-07-10 23:23:26 from Tatsuo Ishii

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kohei KaiGai	2012-07-09 04:38:58	Re: pgsql_fdw in contrib
Previous Message	Tom Lane	2012-07-08 22:52:25	Re: Schema version management