Quick Links

Re: Patch: add conversion from pg_wchar to multibyte

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	ishii(at)postgresql(dot)org
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-10 23:23:26
Message-ID:	20120711.082326.1199398009192084540.t-ishii@sraoss.co.jp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>>>> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>>>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>>>> that I can find.
>>>>>>
>>>>>> I also read in the xemacs internals doc, at
>>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>>>> charset codes. I don't know how we got to the current state of using
>>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>>>> with XEmacs.
>>>>
>>>>> At the time when mule internal code was introduced to PostgreSQL,
>>>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>>>> emacs) was the only implementation allowed to use multi encoding. So I
>>>>> used the specification of mule documented in the URL I wrote.
>>>>
>>>> I see. Given that upstream has decided that a simpler definition is
>>>> more appropriate, is there any reason not to follow their lead, to the
>>>> extent that we can do so without breaking existing on-disk data?
>>>
>>> Please let me spend week end to understand the their latest spec.
>>
>> This is an intermediate report on the internal multi-byte charset
>> implementation of emacen. I have read the link Tom showed. Also I made
>> a quick scan on xemacs-21.4.0 source code, especially
>> xemacs-21.4.0/src/mule-charset.h. It seems the web document is
>> essentially a copy of the comments in the file. Also I looked into
>> other place of xemacs code and I think I can conclude that xeamcs
>> 21.4's multi-byte implementation is based on the doc on the web.
>>
>> Next I looked into emacs 24.1 source code because I could not find any
>> doc regarding emacs's(not xemacs's) implementation of internal
>> multi-byte charset. I found followings in src/charset.h:
>>
>> /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */
>>
>> And these are used like this:
>>
>> /* Read one non-ASCII character from INSTREAM. The character is
>> encoded in `emacs-mule' and the first byte is already read in
>> C. */
>>
>> static int
>> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
>> {
>> :
>> :
>> else if (len == 3)
>> {
>> if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
>> || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
>> {
>> charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
>> code = buf[2] & 0x7F;
>> }
>>
>> As far as I can tell, this is exactly the same way how PostgreSQL
>> handles single private character sets: they consist of 3 bytes, and
>> leading byte is either 0x9a or 0x9b. Other examples regarding single
>> byte/multi-byte private charsets can be seen in coding.c.
>>
>> As far as I can tell, it seems emacs and xemacs employes different
>> implementations of multi-byte charaset regarding "private"
>> charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am
>> contacting to the original Mule author if he knows anything about
>> this.
>
> I got reply from the Mule author, Kenichi Handa (the mail is in
> Japanese. So I do not quote his mail here. If somebody wants to read
> the original mail please let me know). First of all my understanding
> with emacs's implementaion is correct according to him. He did not
> know about xemacs's implementation. Apparently the implementation of
> xemacs was not lead by the original mule author.
>
> So which one of emacs/xemacs should we follow? My suggestion is, not
> to follow xemacs, and to leave the current treatment of private
> leading byte as it is because emacs seems to be more "right" upstream
> comparing with xemacs.
>
>> BTW, while looking into emacs's source code, I found their charset
>> definitions are in lisp/international/mule-conf.el. According to the
>> file several new charsets has been added. Included is the patch to
>> follow their changes. This makes no changes to current behavior, since
>> the patch just changes some comments and non supported charsets.
>
> If there's no objection, I would like to commit this. Objection?

Done along with comment that we follow emacs's implementation, not
xemacs's.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Re: Patch: add conversion from pg_wchar to multibyte at 2012-07-09 04:15:46 from Tatsuo Ishii

Responses

Re: Patch: add conversion from pg_wchar to multibyte at 2012-07-11 05:07:11 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Craig Ringer	2012-07-11 00:37:24	DELETE vs TRUNCATE explanation
Previous Message	Daniel Farina	2012-07-10 23:02:43	Re: Synchronous Standalone Master Redoux