Character encoding

From: Dennis Björklund <db(at)zigo(dot)dhs(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Character encoding
Date: 2003-06-09 07:20:35
Message-ID: Pine.LNX.4.44.0306090848440.17926-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been playing with character encodings and found a problem/bug. I
still use 7.3.2, so it's possible (but I think not) that some of this have
been fixed.

When you run psql with a different language then english the strings are
usually in a character set that is not pure ascii. For example to
represent swedish you need either latin1 or unicode. Therefor the po file
for swedish is in latin1.

Now, these strings are used to create queries that are sent to postgres.
For example if you perform \d in a swedish psql you get

# \d
Lista med relationer
Schema | Namn | Typ | Ägare
--------+-----------+---------+--------
public | boz | tabell | dennis
public | boz_a_seq | sekvens | dennis

where Owner is translated to "Ägare". The problem is now if
the database uses utf-8. Then psql still seems to create queries with
latin1 characters in them which is invalid in utf-8. So I get this:

# \d
ERROR: Invalid UNICODE character sequence found (0xe47273)

It has to be translated to utf-8 before it's sent to the backend.

Actually, in the example above it's not the string "Ägare" that gives the
error message but the value that maps relkind 's' to 'särskild' in
swedish. Seems like column names and column values are treated different

My guess is that the backend don't care what the column name is and
just sends it back. Which is broken if there are different character
encodings at play.

I have also another problem with character sets. I have a unicode
database, and when I set the client encoding to unicode I get nice utf-8
strings back. However, my terminal can not show them so when I run psql I
get strings like "armbåge" (which is what a utf-8 string looks like in
latin1). My client program written using libpq works fine and I get good
utf-8 back.

However, I tried to set the client encoding in psql to latin1 so that it
would show the strings correctly. Then the string above really should be
showed as "armbåge", but it is showed as "armbge".

It should work fine since I know that my strings really are latin1 strings
even when represented as utf-8. Also, the manual says that it should work
for also characters where there is no conversion, it should then become
the hexdecimal value in parentheses.

--
/Dennis

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2003-06-09 09:00:41 Re: Groups and roles
Previous Message The Hermit Hacker 2003-06-09 04:45:13 Archives re-generating ...