Re: [PATCHES] Postgres-6.3.2 locale patch

From: "Thomas G(dot) Lockhart" <lockhart(at)alumni(dot)caltech(dot)edu>
To: "Jose' Soares Da Silva" <sferac(at)bo(dot)nettuno(dot)it>, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, phd2(at)earthling(dot)net
Cc: Postgres Hackers List <hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Postgres-6.3.2 locale patch
Date: 1998-06-04 15:07:11
Message-ID: 3576B81F.D222AD6A@alumni.caltech.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > Sounds interesting idea... But before going into discussion, Let me
> > make clarify what "character sets" means.
> > An "encoding" is a way to represent set of charactser sets in
> > computers.
> > I think SQL92 uses a term "character set" as encoding.

I have found the SQL92 terminology confusing, because they do not seem
to make the nice clear distinction between encoding and collation
sequence which you have pointed out. I suppose that there can be an
issue of visual appearance of an alphabet for different locales also.

afaik, SQL92 uses the term "character set" to mean an encoding with an
implicit collation sequence. SQL92 allows alternate collation sequences
to be specified for a "character set" when it can be made meaningful.

I would propose to implement
VARCHAR(length) WITH CHARACTER SET setname

as a type with a type name of, for example, "VARSETNAME". This type
would have the comparison functions and operators which implement
collation sequences.

I would propose to implement
VARCHAR(length) WITH CHARACTER SET setname COLLATION collname

as a type with a name of, for example, "VARCOLLNAME". For the EUC-jp
encoding, "collname" could be "Korean" or "Japanese" so the type name
would become "varkorean" or "varjapanese". Don't know for sure yet
whether this is adequate, but other possibilities can be used if
necessary.

When a database is created, it can be specified with a default character
set/collation sequence for the database; this would correspond to the
NCHAR/NVARCHAR/NTEXT types. We could implement a
SET NATIONAL CHARACTER SET = 'language';

command to determine the default character set for the session when
NCHAR is used.

The SQL92 technique for specifying an encoding/collation sequence in a
literal string is
_language 'string'

so for example to specify a string in the French language (implying an
encoding, collation, and representation?) you would use
_FRENCH 'string'

> > I would be able to help you in the Japanese part. For Chinese and
> > Korean, I'm going to find volunteers in the local PostgreSQL mailing
> > list I'm running if necessary.
>
> I may help with Italian, Spanish and Portuguese.

Great, and perhaps Oleg could help test with Cyrillic (I assume I can
steal code from the existing "CYR_LOCALE" blocks in the Postgres
backend).

> > Collation sequences for EUC_JP? How nice it would be! One of a
> > problem for collation sequences for multi-byte encodings is the
> > sequence might become huge. Seems you have a solution for that.
> > Please let me know more details.

Um, no, I just assume we can find a solution :/ I'd like to implement
the infrastructure in the Postgres parser to allow multiple
encodings/collations, and then see where we are. As I mentioned, this
would be done for v6.4 as a transparent add-on, so that existing
capabilities are not touched or damaged. Implementing everything for
some European languages (with the 1-byte Latin-1 encoding?) may be
easiest, but the Asian languages might be more fun :)

- Tom

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthew N. Dodd 1998-06-04 15:10:07 Re: [HACKERS] NEW POSTGRESQL LOGOS
Previous Message Andreas Zeugswetter 1998-06-04 13:34:44 AW: [HACKERS] keeping track of connections