Unicode vs SQL_ASCII DBs

From: "John Sidney-Woollett" <johnsw(at)wardbrook(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Unicode vs SQL_ASCII DBs
Date: 2004-01-31 15:32:41
Message-ID: 3379.192.168.0.64.1075563161.squirrel@mercury.wardbrook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi

I need to store accented characters in a postgres (7.4) database, and
access the data (mostly) using the postgres JDBC driver (from a web app).

Does anyone know if:

1) Is there a performance loss using (multibyte) UNICODE vs (single byte)
SQL_ASCII/LATINxxx character encoding? (In terms of extra data, and
searching/sorting speeds).

2) Can SQL_ASCII be used for accented characters.

3) If I want accented characters to sort correctly, must I select UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?

4) I'm not initially expecting arabic, chinese, cyrillic or other language
types to be stored in the database. But if they were, would UNICODE be the
best encoding scheme to use for future proofing the data?

5) If I decide not to support/store non-latin languages, is it better to
use a LATIN encoding over UNICODE?

6) What is MULE_INTERNAL? Does it offer performance advantages?

[This next question probably belongs to the JDBC list, but I'll ask anyway]

7) Because the database is being used to backend a java web application,
are there other issues that I need to be aware of, for example, do I have
to convert all data received to UTF-8 before writing it into the database?
And do I have to ensure that the response (from the webserver)
content-type is always set to UTF-8 to be rendered correctly in a user's
browser?

Thanks for any help/advice.

John Sidney-Woollett

ps I did some tests between two databases; once created using UNICODE
encoding, and the other using SQL_ASCII encoding. The database initdb
command specified no encoding, so I guess that SQL_ASCII is the default
encoding. The results are below:

I created the following table in two databases:

Table "public.table1"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
data | character varying(255) |

Here are the databases; one with UNICODE encoding, and the other with
SQL_ASCII encoding.

List of databases
Name | Owner | Encoding
--------------+----------+-----------
johntest | postgres | UNICODE
johntest2 | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII

I inserted the data into both databases in psql (first setting the client
encoding to LATIN1). Here is the data (the funny character is the word
"test" with the e replaced by an e acute (233).

set client_encoding to 'LATIN1';

insert into table1 values (1, 'tést');
insert into table1 values (2, 'tast');
insert into table1 values (3, 'tost');
insert into table1 values (4, 'test');

Now select the data ordered

(UNICODE DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
1 | tést
3 | tost

(SQL_ASCII DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
3 | tost
1 | tést

NOTE: Bad ordering using SQL_ASCII...

Now I did some other tests, I set the client encoding to UNICODE, and
retrieved the data from both databases:

set client_encoding to 'UNICODE';

(SQL_ASCII DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
3 | tost
1 | tst

NOTE: You can see that the e-acute has been "lost"...

(UNICODE DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
1 | tést
3 | tost

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Louis LeBlanc 2004-01-31 16:03:21 Large object insert/update and oid use
Previous Message Michael Glaesemann 2004-01-31 10:49:15 Re: Two joins on same foreign key