Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem

From: Thomas Lockhart <lockhart(at)alumni(dot)caltech(dot)edu>
To: phd2(at)earthling(dot)net
Cc: "'Oleg Bartunov'" <oleg(at)sai(dot)msu(dot)su>, "'hackers(at)postgresql(dot)org'" <hackers(at)postgreSQL(dot)org>, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
Date: 1999-06-11 15:23:57
Message-ID: 37612A0D.BAF8B997@alumni.caltech.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > istm that the Russian and Japanese contingents could represent the
> > needs of multibyte and locale concerns very well. So, we should ask
> > ourselves some questions to see if we can make *progress* in evolving
> > our text handling, rather than just staying the same forever.
> Ok, we are here.
> And what a pros and cons for NCHAR?

I was hoping you would tell me! :)

> > SQL92 suggests some specific text handling features to help with
> > non-ascii applications.
> What the help?

OK, SQL92 defines two kinds of native character sets: those we already
have (char, varchar) and those which can be locale customized (nchar,
national character varying, and others). char and varchar always
default to the "SQL" behavior (which I think corresponds to ascii
(called "SQL_TEXT") but I didn't bother looking for the details).

So, at its simplest, there would be two sets of character types, with
char, varchar, etc, always the same on every system (just like
Postgres w/o multibyte or locale), and nchar, nvarchar, etc configured
as your locale/multibyte environment would want.

However, there are many more features in SQL92 to deal with text
customization. I'll mention a few (well, most of them, but not in
detail):

o You can define a "character set" and, independently, a "collation".
The syntax for the type definition is
CHARACTER [ VARYING ] [ (length) ]
[ CHARACTER SET your-character-set ]
[ COLLATE your-collation-sequence ]

o You can specify a character type for string literals:
_your-character-set 'literal string' e.g. _ESPANOL 'Que pasa?'
(forgive my omission of a leading upside down question mark :)
We already have some support for this in that character string
literals can have a type specification (e.g. "DATETIME 'now'") and
presumably we can use the required underscore to convert the
"_ESPANOL" to a character set and collation, all within the existing
Postgres type system.

o You can specify collation behavior in a strange way:
'Yo dude!' COLLATE collation-method
(which we could convert in the parser to a function call).

o You can translate between character sets, *if* there is a reasonable
mapping available:
TRANSLATE(string USING method)
and you can define translations in a vague way (since no one but
Postgres implemented a type system back then):
CREATE TRANSLATION translation
FOR source-charset
TO target-charset
FROM { EXTERNAL('external-translation') | IDENTITY |
existing-translation }
DROP TRANSLATION translation

o You can convert character strings which have the same character
"repertoire" from one to the other:
CONVERT(string USING conversion-method)
(e.g. we could define a method "EBCDIC_TO_ASCII" once we have an
"EBCDIC" character set).

o You can specify identifiers (column names, etc) with a specific
character set/collation by:
_charset colname (e.g. _FRANCAIS Francais where the second "c" is
allowed to be "c-cedilla", a character in the French/latin character
set; sorry I didn't type it).

> > Would these mechanisms work for people? Or are they so fundamentally
> > flawed or non-standard (it is from a standard, but I'm not sure who
> > implements it)?

Fully implementing these features (or a reasonable subset) would give
us more capabilities than we have now, and imho can be fit into our
existing type system. *Only* implementing NCHAR etc gives us the
ability to carry SQL_TEXT and multibyte/locale types in the same
database, which may not be a huge benefit to those who never want to
mix them in the same installation. I don't know who those folks might
be but Tatsuo and yourself probably do.

Comments?

- Thomas

--
Thomas Lockhart lockhart(at)alumni(dot)caltech(dot)edu
South Pasadena, California

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Lockhart 1999-06-11 15:38:38 "DML"
Previous Message Tatsuo Ishii 1999-06-11 15:14:55 Re: [HACKERS] Re: locales and MB (was: Postgres 6.5 beta2 and beta3 problem)