Skip site navigation (1) Skip section navigation (2)

Re: Patch: add conversion from pg_wchar to multibyte

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: robertmhaas(at)gmail(dot)com
Cc: aekorotkov(at)gmail(dot)com, ishii(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: add conversion from pg_wchar to multibyte
Date: 2012-07-02 23:33:36
Message-ID: 20120703.083336.1290159206305528932.t-ishii@sraoss.co.jp (view raw or flat)
Thread:
Lists: pgsql-hackers
> Yeah, I did.  I think I may be a bit confused here, so let me try to
> understand this a bit better.  It seems like pg_mule2wchar_with_len
> uses the following algorithm:
> 
> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
> with shifts of 16 and 0.
> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
> skipping the first one and storing the remaining two with shifts of 16
> and 0.
> - If the first character IS_LC2 (0x90-0x99), decode three bytes,
> stored with shifts of 16, 8, and 0.
> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
> skipping the first one and storing the remaining three with offsets of
> 16, 8, and 0.

Correct.

> In the reverse transformation implemented by pg_wchar2mule_with_len,
> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
> 3 bytes, respectively, exactly as I would expect.  ASCII decoding is
> also as I would expect.  The case I don't understand is what happens
> when the leading byte of the multibyte character was IS_LCPRV1 or
> IS_LCPRV2.  In that case, we ought to decode three bytes if it was
> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
> always decode 4 bytes.  That implies that the IS_LCPRV1() case in
> pg_mule2wchar_with_len is dead code,

Yes, dead code unless we want to support following encodings in the
future(from include/mb/pg_wchar.h:
#define LC_SISHENG			0xa0/* Chinese SiSheng characters for
								 * PinYin/ZhuYin (not supported) */
#define LC_IPA				0xa1/* IPA (International Phonetic Association)
								 * (not supported) */
#define LC_VISCII_LOWER		0xa2/* Vietnamese VISCII1.1 lower-case (not
								 * supported) */
#define LC_VISCII_UPPER		0xa3/* Vietnamese VISCII1.1 upper-case (not
								 * supported) */
#define LC_ARABIC_DIGIT		0xa4	/* Arabic digit (not supported) */
#define LC_ARABIC_1_COLUMN	0xa5	/* Arabic 1-column (not supported) */
#define LC_ASCII_RIGHT_TO_LEFT	0xa6	/* ASCII (left half of ISO8859-1) with
										 * right-to-left direction (not
										 * supported) */
#define LC_LAO				0xa7/* Lao characters (ISO10646 0E80..0EDF) (not
								 * supported) */
#define LC_ARABIC_2_COLUMN	0xa8	/* Arabic 1-column (not supported) */

> and that any 4 byte characters
> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
> comment there is driving at, but it's not too clear to me.

Yes, that's because we only support EUC_TW and BIG5 which are using
IS_LCPRV2 in the mule interal encoding, as stated in the comment.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2012-07-03 00:12:33
Subject: Re: Patch: add conversion from pg_wchar to multibyte
Previous:From: Tom LaneDate: 2012-07-02 23:01:14
Subject: Re: Event Triggers reduced, v1

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group