Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Sergey Burladyan" <eshkinkot(at)gmail(dot)com>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
Date: 2008-03-19 22:16:34
Message-ID: 47E190C2.80504@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Heikki Linnakangas wrote:
> Sergey Burladyan wrote:
>> src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
>> does not have cyrillic letter 'IO' in ISO-8859-5 to mule internal code
>> translation table (function iso2mic(const unsigned char *l, unsigned
>> char *p, int len)). this is bug, because it is widely used and it is
>> main letter like A, B or C in english :) and it is exist in all
>> russian cyrillic's encoding (koi8-r, iso-8859-5, windows-1251, cp866).
>> for example, in russian, words 'all', 'hedgehog', 'Christmas-tree' and
>> many other must be written with it.
>>
>> here is the patch for add it to ISO-8859-5 to mule internal code
>> translation table. i am don't know is this ok and do not brake any
>> internal rule or code ?
>
> You'd need to modify the mic->ISO-8859-5 translation table as well, for
> converting in the other direction.

Here's a patch that does the conversion in the other direction as well.
As I'm not too familiar with cyrillic, can you double-check that this
works? I tested it using the convert() function between different
encodings, and it seems ok to me.

>> By the way, as i can understand you are using koi8-r encoding for
>> internal representation of cyrillic charsets - this is have also
>> another problem. the second "widely" used char is <U2116> NUMERO SIGN
>> (many accountants and managers use it :) in cyrillic windows world)
>> and it is exist in windows-1251, cp866 and iso-8859-5 encoding, but
>> not in koi8-r...
>
> Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an
> intermediate encoding, because there's no direct conversion table
> between ISO-8859-5 and the other cyrillic encodings. Ideally there would
> be. Another possibility would be to use UTF-8 as the intermediate
> encoding; that'd probably be much slower, but UTF-8 should have all the
> characters needed.
>
> Is there any other characters like "YO" that are missing, that exist in
> all the encodings? Looking at the character set table for KOI8-R, it
> looks like the "YO" is in an odd place in the table, compared to all
> other cyrillic characters. Perhaps that's why it was missed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
cyrillic-2.patch text/x-diff 2.8 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sergey Burladyan 2008-03-20 03:33:03 Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
Previous Message Tom Lane 2008-03-19 18:56:14 Re: BUG #4044: Incorrect RegExp substring Output