Re: Re: Big 7.1 open items

From: Randall Parker <rgparker(at)west(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: Big 7.1 open items
Date: 2000-06-16 23:18:23
Message-ID: MPG.13b4559da89d333c989813@news.west.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas,

A few (hopefully relevant) comments regarding character sets, code pages,
I18N, and all that:

1) I've seen databases (DB2 if memory serves) that allowed the client
side to declare itself to the database back-end engine as being in a
particular code page. For instance, one could have a CP850 Latin-1 client
and an ISO 8859-1 database. The database engine did appropriate
translations in both directions.

2) Mixing code pages in a single column and then having the database
engine support it is not trivial. Either each CHAR/VARCHAR would have to
have some code page settable per row (eg either as a separate column or
as something like mycolumnname.encoding).
Even if you could handle all that you'd still be faced with the issue
is collating sequence. Each individual code page will have a collating
sequence. But how do you collate across code pages? There'd be letters
that were only in a single code page. Plus, it gets messy because with,
for instance, a simple umlauted a that occurs in CP850, CP1252, and ISO
8859-1 (and likely in other code pages as well). That letter is really
the same letter in all those code pages and should treated as such when
sorting.

3) I think it is more important for a database to support lots of
languages in the stored data than in the field names and table names. If
a programmer has to deal with A-Za-z for naming identifiers and that
perseon is Korean or Japanese then that is certain is an imposition on
them. But its a far far bigger imposition if that programmer can't build
a database that will store the letters of his national language and sort
and index and search them in convenient ways.

4) The real solution to the multiple code page dilemma is Unicode.
Yes, its more space. But the can of worms of dealing with multiple
code pages in a column is really no fun and the result is not great.
BTDTHTTS.

5) The problem with enforcing
I've built a database in DB2 where particular columns in it contained
data from many different code pages (each row had a code page field as
well as a text field). For some applications that is okay if that field
is not going to be part of an index.
However, if a database is going to be defined as being in a particular
code page, and if the database engine is going to reject characters that
are not recognized as part of that code page then you can't play the sort
of game I just described _unless_ there is a different datatype that is
similar to CHAR/VARCHAR but for which the RDBMS does not enforce code
page legality on each character. Otherwise you choose some code page for
a column, you go merrily stuffing in all sorts of rows in all sorts of
code pages, and then along come some character that is of a value that is
not a value for some other character in the code page that the RDBMS
thinks it is.

Anyway, I've done lots of I18N database stuff and hopefully a few of my
comments will be useful to the assembled brethren <g>.

In news:<3948E4D7(dot)A3B722E9(at)alumni(dot)caltech(dot)edu>,
lockhart(at)alumni(dot)caltech(dot)edu says...
> One issue: I can see (or imagine ;) how we can use the Postgres type
> system to manage multiple character sets. But allowing arbitrary
> character sets in, say, table names forces us to cope with allowing a
> mix of character sets in a single column of a system table. afaik this
> general capability is not mandated by SQL9x (the SQL_TEXT character set
> is used for all system resources??). Would it be acceptable to have a
> "default database character set" which is allowed to creep into the
> pg_xxx tables? Even that seems to be a difficult thing to accomplish at
> the moment (we'd need to get some of the text manipulation functions
> from the catalogs, not from hardcoded references as we do now).
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2000-06-16 23:30:25 Re: Big 7.1 open items
Previous Message Tom Lane 2000-06-16 23:16:37 Re: Big 7.1 open items