Re: Server-side support of all encodings

From: "Dezso Zoltan" <dezso(dot)zoltan(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-03-28 01:44:00
Message-ID: 7568ba740703271844k69050a61g7e0f6da17e5a4240@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Everyone,

I very much understand why SJIS is not a server encoding. It contains
ASCII second bytes (including \ and ' both of which can be really
nasty inside a normal sql) and further, half-width katakana is
represented as one byte-characters, incidentally two of which coincide
with a kanji.

My question is, however: what would be the best practice if it was
imperative to use SJIS encoding for texts and no built-in conversions
are useful? To elaborate, I need to support japanese emoji characters,
which are special emoticons for mobile phones. These characters are
usually in a region that is not specified by the standard SJIS,
therefore they are not properly converted either to EUC or UTF8 (which
would be my prefered choice, but unfortunately not all mobile phones
support it, so conversion is still necessary - from what i've seen,
the new SJIS_2004 map seems to define these entities, but I'm not 100%
sure they all get converted properly).

I inherited a system in which this problem is "bypassed" by setting
SQL_ASCII server encoding, but that is not the best solution (full
text search is rendered useless and occasionally the special character
issue rears its ugly head - not only do we have to deal with normal
sqlinjection, but also encoding-based injections) (and for the real
WTF, my predecessor converted everything to EUC before inserting -
eventually losing all the emojis and creating all sorts of strange
phenomena, like tables with one column in euc until a certain date and
sjis from then on while euc for all other columns)

Is there a way to properly deal with sjis+emoji extensions (a patch
i'm not aware of, for example), is it considered as a todo for further
releases or should i consider augmenting postgres in a way (if the
latter, could you provide any pointers on how to proceed?)

Thank you,
Zaki

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
Sent: Monday, March 26, 2007 11:20 AM
To: ITAGAKI Takahiro
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Server-side support of all encodings

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
> but we cannot use them as server encodings. Are there any reason for it?

Very much so --- they aren't safe ASCII-supersets, and thus for example
the parser will fail on them. Backend encodings must have the property
that all bytes of a multibyte character are >= 128.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Koichi Suzuki 2007-03-28 01:54:11 Re: [PATCHES] Full page writes improvement, code update
Previous Message Tom Lane 2007-03-28 01:34:46 Re: Warning on contrib/tsearch2