new FAQ entry (was:Re: UTF8 problem)

From: Tim Allen <tim(at)proximity(dot)com(dot)au>
To: pgsql-general(at)postgresql(dot)org
Subject: new FAQ entry (was:Re: UTF8 problem)
Date: 2006-06-08 07:00:22
Message-ID: 4487CB06.7020904@proximity.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Matthew T. O'Connor wrote:
> Well, to answer my own question, I hacked the source code of DBMail and
> had it set the client encoding to LATIN1 immediately after database
> connect, this seems to have fixed the problem.
>
> Sorry for the noise,
>
> Matt

I've seen this sort of problem asked about in the mailing lists often
enough to think it merits a FAQ entry, so how about this text:

<entry>
Q. Why do I have problems inserting text into my database, with error
messages like

ERROR: invalid byte sequence for encoding "UTF8": 0xe1202c ?

A. Almost certainly that byte sequence really is an invalid byte
sequence for that encoding. The reason you are seeing the error is
probably because you are providing text in some other encoding. You and
the database need to agree between you what encoding you're using.
PostgreSQL is fairly good at working with you, converting to and from
whatever encoding you want to use, but you need to tell it what that
encoding is, and then stick to that encoding consistently.

If you don't set the client encoding, then PostgreSQL will use the
default encoding for the database, which in modern times is often UTF8
(aka UNICODE), and is set at database creation time. However, many
client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so
you need to either educate the client app to use UTF8, or get it to
inform PostgreSQL what other encoding to use.

The way to tell PostgreSQL what encoding you want to use is by use of
the client_encoding GUC variable, eg

set client_encoding to 'LATIN1';

One reason you may be seeing this problem now, after upgrading your
version of PostgreSQL, is that recent versions have tighter validation
of encoded text. Previously you may not have been conscious of what
encoding you were actually using, especially if you're a speaker of a
Western European language, and may have gotten away with writing
incorrectly-encoded text without the database complaining. Now is the
time to start getting it right.

One thing to be wary of is the "SQL_ASCII" encoding. It appears to be
commonly and incorrectly believed that this represents either some
variant on latin1, or pure 7-bit ASCII. It is neither of those, but a
completely unchecked encoding that really means whatever you want it to
mean. This makes it not a very good encoding to use in practice, as it
becomes prone to allowing a mixture of different encodings to be present
in the same set of data, which will cause you headaches when you try to
convert the whole lot to some consistent encoding in the future.

See section 21.2 of the documentation for more complete information.
</entry>

Tim

--
-----------------------------------------------
Tim Allen tim(at)proximity(dot)com(dot)au
Proximity Pty Ltd http://www.proximity.com.au/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tino Wildenhain 2006-06-08 07:10:48 Re: new FAQ entry
Previous Message Tino Wildenhain 2006-06-08 06:37:23 Re: UTF8 problem