Re: [HACKERS] UNICODE

From: Patrice Hédé <phede-ml(at)islande(dot)org> (by way of Jean-Michel POURE <jm(dot)poure(at)freesurf(dot)fr>)
To: pgsql-general(at)postgresql(dot)org
Subject: Re: [HACKERS] UNICODE
Date: 2001-10-29 10:03:13
Message-ID: 4.2.0.58.20011029110232.00d0aeb0@pop.freesurf.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

This is a redirection from pgsql-hackers.
*****************************************************************************
Hi Jean-Michel,

* Jean-Michel POURE <jm(dot)poure(at)freesurf(dot)fr> [011028 18:23]:
>
> >psql uses your input literally - so is your console/xterm in
> >UNICODE/UTF8?
> Client: \encoding returns 'UNICODE'.
> Server: \list show databases. All databases are UNICODE (except
> TEMPLATE0 and TEMPLATE1 which are ASCII of course). I use a Mandrake
> 8.1 distribution and think my console is UNICODE.

I don't know the details for the Mandrake distribution, but I would
rather think the default terminal to be iso-8859-15 or iso-8859-1
encoded (I use myself a linux debian sid, customised to be mixed
iso-8859-15/utf-8 :) ).

In that case, it's likely to cause problems.
One thing is to check your current locale (before running psql), by
typing "locale charmap" on your terminal :

Unicode :

asterix:~$ locale charmap
UTF-8

latin-9 (fr_FR(at)euro) :

asterix:~$ locale charmap
ISO-8859-15

Then, if you really have a Unicode term, then you may run into other
problems. Psql uses readline, and readline is not yet "utf-8" enabled
by default. There are patches for that, but I don't know why they
don't integrate the support into the code... whatever the reason, it
means that for example Backspace won't work over characters with more
than one byte, and that includes everything which is not ASCII.

So, if while typing in psql, you try to do some text editing over the
"é", then it's likely to mangle your input to psql (without
necessarily be visible in your terminal), and anything from a bad
commandline, to psql waiting for more input... When you've finished
typing your line, check if psql prompt is displaying an "=" sign :

tests=#

Third, depending on how your data is entered vs queried, it may have
some differences. For example, if you use an application which
converts UTF-8 data to D-normalisation before submitting to
PostgreSQL, then the "é" will be stored as "e"+"combining mark acute
accent". Then, when you do your query, you have to submit in the same
format, as "é" (directly typed from the keyboard) and "e"+"comb.acute
accent" are two different things (I plan to add support in PostgreSQL
for this kind of stuff for 7.3, if I manage to go a bit faster on my
other projects...).

Anyway, I have been trying a query like yours, using a UTF-8 xterm,
with a UNICODE encoding, both psql and database :

my table :

tests=# insert into matable values ('un texte accentué', 12);
INSERT 70197 1
tests=# insert into matable values ('ça accentue le problème', 14);
INSERT 70198 1

tests=# select * from matable;
montext | valeur
-------------------------+--------
un texte accentué | 12
ça accentue le problème | 14
(2 rows)

[note that the "é", "ç" and "è" are not combining forms here...]

tests=# select * from matable where montext ilike '%accentué%';
montext | valeur
-------------------+--------
un texte accentué | 12
(1 row)

It works fine for me.

> >> As for me, I typed INSERT INTO source_content VALUES ('Permis de
> >> conduire accepté') in Psql.
> >As I said - psql does not do any conversion.
> The faulty query is: INSERT INTO test (source_content) VALUES
> ('Permis de conduire accepté');
>
> I just can't believe that Psql is not UTF-8 compatible. It seems
> unreal as Psql is PostgreSQL #1 helper application. Should I use
> PostgreSQL MULE encoding to have automatic trans coding. What are
> the guidelines, I am completely lost.

Psql is UTF-8 compatible. However, the terminal support of UTF-8 may
be a little shaky for now (no dead keys, no compose key) and that will
be fixed in Xfree-4.2, and readline support of UTF-8 is deficient (as
is bash's, where readline comes from). I don't know when *that* will
be fixed. I know http://www.li18nux.org/ has some patches, but I
haven't tried them yet.

> >> Psql does not insert the data and I have to kill it manually. Can
> >> you reproduce this?
> >No. If it hangs this is serious problem. Or did you simply forgot
> >final ';' ? It btw does not seem valid sql to me, considering you
> >previously provided table structure.
> Is it possible that my database is corrupted? I have used pg_dump
> several times to dump data from production server to development
> servers and conversely. Does pg_dump produce UTF8 output? What are
> the guidelines when using UTF-8: forget psql and pg_dump?

One thing you really have to be careful about is the locale you're
running your terminal into (cf above with "locale charmap"). A lot of
tools are sensitive to that, as soon as they set the locale, and also
the terminal itself is sensitive to that (if you run an xterm, a
gnome-terminal or other, make sure they are started themselves with
the correct locale, rather than the locale being set by a .bashrc or
.profile AFTER the xterm is launched. One way to be sure is to launch
an Xterm from the command line in an other xterm ;) ).

> >In the end: are the strings/queries you give to psql/pg_exec UTF-8
> >- this is now main thing, as you have _configured_ everything
> >correctly.
> Everything is configured correctly server-side (PostgreSQL, Psql).
>
> Thank you very much for your support Marko,
> Best regards,
> Jean-Michel

It's possible to work with psql and UTF-8, I'm using it :) But support
for utf-8 is not complete yet, and it's not seamless. Also, support in
Postgresql is not yet complete for UTF-8 (normalisation forms,
collation, regexes...), but it'll come :)

Patrice.

--
Patrice Hédé
email: patrice hede à islande org
www : http://www.islande.org/

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Browse pgsql-general by date

  From Date Subject
Next Message Jean-Michel POURE 2001-10-29 10:05:00 Re: resend: Chinese sort order problem
Previous Message Andy Hallam 2001-10-29 09:58:04 SELECT with backslash '\' character