Character sets (Re: Re: Big 7.1 open items)

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Thomas Lockhart <lockhart(at)alumni(dot)caltech(dot)edu>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgsql-hackers(at)hub(dot)org
Subject: Character sets (Re: Re: Big 7.1 open items)
Date: 2000-06-20 16:43:44
Message-ID: Pine.LNX.4.21.0006200102490.353-100000@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Lockhart writes:

> One issue: I can see (or imagine ;) how we can use the Postgres type
> system to manage multiple character sets.

But how are you going to tell a genuine "type" from a character set? And
you might have to have three types for each charset. There'd be a lot of
redundancy and confusion regarding the input and output functions and
other pg_type attributes. No doubt there's something to be learned from
the type system, but character sets have different properties -- like
characters(!), collation rules, encoding "translations" and what not.
There is no doubt also need for different error handling. So I think that
just dumping every character set into pg_type is not a good idea. That's
almost equivalent to having separate types for char(6), char(7), etc.

Instead, I'd suggest that character sets become separate objects. A
character entity would carry around its character set in its header
somehow. Consider a string concatenation function, being invoked with two
arguments of the same exotic character set. Using the type system only
you'd have to either provide a function signature for all combinations of
characters sets or you'd have to cast them up to SQL_TEXT, concatenate
them and cast them back to the original charset. A smarter concatentation
function instead might notice that both arguments are of the same
character set and simply paste them together right there.

> But allowing arbitrary character sets in, say, table names forces us
> to cope with allowing a mix of character sets in a single column of a
> system table.

The priority is probably the data people store, not the way they get to
name their tables.

> Would it be acceptable to have a "default database character set"
> which is allowed to creep into the pg_xxx tables?

I think we could go with making all system table char columns Unicode, but
of course they are really of the "name" type, which is another issue
completely.

> We should itemize all of these issues so we can keep track of what is
> necessary, possible, and/or "easy".

Here are a couple of "items" I keep wondering about:

* To what extend would we be able to use the operating systems locale
facilities? Besides the fact that some systems are deficient or broken one
way or another, POSIX really doesn't provide much besides "given two
strings, which one is greater", and then only on a per-process basis.
We'd really need more that, see also LIKE indexing issues, and indexing in
general.

* Client support: A lot of language environments provide pretty smooth
Unicode support these days, e.g., Java, Perl 5.6, and I think that C99 has
also made some strides. So while "we can store stuff in any character set
you want" is great, it's really no good if it doesn't work transparently
with the client interfaces. At least something to keep in mind.

--
Peter Eisentraut Sernanders väg 10:115
peter_e(at)gmx(dot)net 75262 Uppsala
http://yi.org/peter-e/ Sweden

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2000-06-20 17:53:26 Re: Big 7.1 open items
Previous Message Peter Eisentraut 2000-06-20 16:43:35 Re: Big 7.1 open items