Re: text column constraint, newbie question

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: RebeccaJ <rebeccaj(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: text column constraint, newbie question
Date: 2009-03-23 23:54:43
Message-ID: 49C82143.9080500@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

RebeccaJ wrote:

> And I wonder why you like SQL_ASCII better than UTF8, and whether
> others have any opinions about those two. (My web server's LC_CTYPE is
> C, so I can use any character set.) Wouldn't UTF8 allow more
> characters than SQL_ASCII?

I've had a LOT of experience dealing with apps that use 8-bit byte
strings (like SQL_ASCII `text') to store data, and I've rarely seen one
that *doesn't* have text encoding handling bugs.

If you store your text as byte streams that don't know, check, or
enforce their own encoding you must keep track of the encoding
separately - either with another value stored alongside the string, or
through your app logic.

If you start storing data with multiple different text encodings in the
DB, you're most likely to land up tracking down annoying "corrupt text"
bugs sooner or later.

If, on the other hand, you use UTF-8, you *know* that everything in the
database is well-formed UTF-8. You ensure that it is UTF-8 before
storing it in the DB and know it'll be UTF-8 coming out. The DB takes
care of encoding conversion for you if you ask it to, by setting
client_encoding - the only downside being that it'll refuse to return
strings that can't be represented in your current client_encoding, like
say Cyrillic (Russian etc) text if you're using ISO-8859-1 (latin-1) for
your client encoding.

Even with a UTF-8 database you must still get your I/O to/from libraries
and the rest of the system right, converting UTF-8 text to whatever the
system expects or vice versa. Alternately, if you set client_encoding,
you must be prepared for cases where the DB can't send you what you ask
for because your encoding can't represent it.

All in all, I personally think a UTF-8 database is considerably better
for most uses. There are certainly cases where I'd use SQL_ASCII, but
not most.

--
Craig Ringer

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2009-03-24 00:00:15 Re: Last modification time of a database?
Previous Message Alban Hertroys 2009-03-23 23:39:49 Re: text column constraint, newbie question