Skip site navigation (1) Skip section navigation (2)

Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

From: Sergey Burladyan <eshkinkot(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Subject: Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding
Date: 2008-03-20 03:33:03
Message-ID: 200803200633.03865.eshkinkot@gmail.com (view raw or flat)
Thread:
Lists: pgsql-bugs
Thursday 20 March 2008 01:16:34 Heikki Linnakangas:

Thanks for answer, Heikki !

> You'd need to modify the mic->ISO-8859-5 translation table as well, for
> converting in the other direction.
oops, i have not thought about it %)

> Here's a patch that does the conversion in the other direction as well.
> As I'm not too familiar with cyrillic, can you double-check that this
> works? I tested it using the convert() function between different
> encodings, and it seems ok to me.

yes, i test it with function like this and it work now :)

create or replace function test_convert() returns setof record as $$
declare
  --- russian alphabet, 33 upper and 33 lower letters in utf-8 encoding
  r bytea default 
E'\320\260\320\261\320\262\320\263\320\264\320\265\321\221\320\266\320\267\320\270\320\271\320\272\320\273\320\274\320\275\320\276\320\277\321\200\321\201\321\202\321\203\321\204\321\205\321\206\321\207\321\210\321\211\321\212\321\213\321\214\321\215\321\216\321\217\320\220\320\221\320\222\320\223\320\224\320\225\320\201\320\226\320\227\320\230\320\231\320\232\320\233\320\234\320\235\320\236\320\237\320\240\320\241\320\242\320\243\320\244\320\245\320\246\320\247\320\250\320\251\320\252\320\253\320\254\320\255\320\256\320\257';
  s bytea; --- converted to result
  t bytea; --- converted back result
  res record;
begin
  raise notice 'russian ABC: "%"', encode(r, 'escape');
  s := convert(r, 'utf-8', 'iso-8859-5');

  t := convert(s, 'iso-8859-5', 'windows-1251'); t := 
convert(t, 'windows-1251', 'utf-8');
  if t != r then
     raise exception 'iso-8859-5, windows-1251 | t != r';
  end if;
  res := row('iso-8859-5, windows-1251'::text, encode(
      
convert(convert(s, 'iso-8859-5', 'windows-1251'), 'windows-1251', 'utf-8')
      , 'escape')::text
  );
  return next res;
[...skip...]

seb=# select * from test_convert() as (conv text, res text);
NOTICE:  russian ABC: "абвгдеёжз..."
            conv            |    res
----------------------------+-----------
 iso-8859-5, windows-1251   | абвгдеёжз...
 iso-8859-5, windows-866    | абвгдеёжз...
 iso-8859-5, koi8-r         | абвгдеёжз...
 iso-8859-5, iso-8859-5     | абвгдеёжз...
 windows-866, windows-1251  | абвгдеёжз...
 windows-866, iso-8859-5    | абвгдеёжз...
 windows-866, koi8-r        | абвгдеёжз...
 windows-866, windows-866   | абвгдеёжз...
 windows-1251, windows-866  | абвгдеёжз...
 windows-1251, iso-8859-5   | абвгдеёжз...
 windows-1251, koi8-r       | абвгдеёжз...
 windows-1251, windows-1251 | абвгдеёжз...
 koi8-r, windows-866        | абвгдеёжз...
 koi8-r, iso-8859-5         | абвгдеёжз...
 koi8-r, windows-1251       | абвгдеёжз...
 koi8-r, koi8-r             | абвгдеёжз...
(16 rows)

> Hmm. We use KOI8-R (or rather, MULE_INTERNAL with KOI8-R ) as an
> intermediate encoding, because there's no direct conversion table
> between ISO-8859-5 and the other cyrillic encodings. Ideally there would
> be. Another possibility would be to use UTF-8 as the intermediate
> encoding; that'd probably be much slower, but UTF-8 should have all the
> characters needed.
I think that UTF-8 is too complex for translate 8-bit charset to another 8-bit 
charset, but other solution is many many translate tables... hard question %)

> Is there any other characters like "YO" that are missing, that exist in
> all the encodings? 
if we say about alphabet letters, the answer is - No, only "YO" was missing.
if we say about any character, there is 'NO-BREAK SPACE' (U+00A0) it exist in 
1251, 866, koi8-r and iso but i do not think that it widely used...

> Looking at the character set table for KOI8-R, it 
> looks like the "YO" is in an odd place in the table, compared to all
> other cyrillic characters. Perhaps that's why it was missed.
Yes, i understand. russian character sets always been a challenge for all 
programmers :) it are at least five, and it are all different

Thanks for patch, Heikki !

---

In response to

Responses

pgsql-bugs by date

Next:From: NikhilSDate: 2008-03-20 06:49:49
Subject: Re: Problem identifying constraints which should not be inherited
Previous:From: Heikki LinnakangasDate: 2008-03-19 22:16:34
Subject: Re: 8.3 can't convert cyrillic text from 'iso-8859-5' to other cyrillic 8-bit encoding

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group