Re: main log encoding problem

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: exclusion(at)gmail(dot)com
Cc: ishii(at)postgresql(dot)org, pgsql-general(at)postgresql(dot)org, ringerc(at)ringerc(dot)id(dot)au, yi(dot)codeplayer(at)gmail(dot)com, pgsql-bugs(at)postgresql(dot)org
Subject: Re: main log encoding problem
Date: 2012-07-19 07:24:26
Message-ID: 20120719.162426.57507866406108945.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-general pgsql-hackers

>> I am thinking about variant of C.
>>
>> Problem with C is, converting from other encoding to UTF-8 is not
>> cheap because it requires huge conversion tables. This may be a
>> serious problem with busy server. Also it is possible some information
>> is lossed while in this conversion. This is because there's no
>> gualntee that there is one-to-one-mapping between UTF-8 and other
>> encodings. Other problem with UTF-8 is, you have to choose *one*
>> locale when using your editor. This may or may not affect handling of
>> string in your editor.
>>
>> My idea is using mule-internal encoding for the log file instead of
>> UTF-8. There are several advantages:
>>
>> 1) Converion to mule-internal encoding is cheap because no conversion
>> table is required. Also no information loss happens in this
>> conversion.
>>
>> 2) Mule-internal encoding can be handled by emacs, one of the most
>> popular editors in the world.
>>
>> 3) No need to worry about locale. Mule-internal encoding has enough
>> information about language.
>> --
>>
> I believe that postgres has such conversion functions anyway. And they
> used for data conversion when we have clients (and databases) with
> different encodings. So if they can be used for data, why not to use
> them for relatively little amount of log messages?

Frontend/Backend encoding conversion only happens when they are
different. While conversion for logs *always* happens. A busy database
could produce tons of logs (i is not unusual that log all SQLs for
auditing purpose).

> And regarding mule internal encoding - reading about Mule
> http://www.emacswiki.org/emacs/UnicodeEncoding I found:
> /In future (probably Emacs 22), Mule will use an internal encoding
> which is a UTF-8 encoding of a superset of Unicode. /
> So I still see UTF-8 as a common denominator for all the encodings.
> I am not aware of any characters absent in Unicode. Can you please
> provide some examples of these that can results in lossy conversion?

You can google by "encoding "EUC_JP" has no equivalent in "UTF8"" or
some such to find such an example. In this case PostgreSQL just throw
an error. For frontend/backend encoding conversion this is fine. But
what should we do for logs? Apparently we cannot throw an error here.

"Unification" is another problem. Some kanji characters of CJK are
"unified" in Unicode. The idea of unification is, if kanji A in China,
B in Japan, C in Korea looks "similar" unify ABC to D. This is a great
space saving:-) The price of this is inablity of
round-trip-conversion. You can convert A, B or C to D, but you cannot
convert D to A/B/C.

BTW, I'm not stick with mule-internal encoding. What we need here is a
"super" encoding which could include any existing encodings without
information loss. For this purpose, I think we can even invent a new
encoding(maybe something like very first prposal of ISO/IEC
10646?). However, using UTF-8 for this purpose seems to be just a
disaster to me.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tatsuo Ishii 2012-07-19 07:49:51 Re: main log encoding problem
Previous Message Alexander Law 2012-07-19 07:23:39 Re: main log encoding problem

Browse pgsql-general by date

  From Date Subject
Next Message Tatsuo Ishii 2012-07-19 07:49:51 Re: main log encoding problem
Previous Message Alexander Law 2012-07-19 07:23:39 Re: main log encoding problem

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2012-07-19 07:49:51 Re: main log encoding problem
Previous Message Alexander Law 2012-07-19 07:23:39 Re: main log encoding problem