Re: Thoughts on multiple simultaneous code page support

From: "Randall Parker" <randall(at)nls(dot)net>
To: "Giles Lean" <giles(at)nemeton(dot)com(dot)au>
Cc: "PostgreSQL-Dev" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on multiple simultaneous code page support
Date: 2000-06-22 01:52:34
Message-ID: 01501518836812@mail.nls.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 22 Jun 2000 11:17:14 +1000, Giles Lean wrote:

>
>> 1) Make the entire database Unicode
>> ...
>> It also makes sorting and indexing take more time.
>
>Mentioned in my other email, but what collation order were you
>proposing to use? Binary might be OK for unique keys but that doesn't
>help you for '<', '>' etc.

To use Unicode on a field that can have indexes defined on it does require one single big
collation order table that determines the relative order of all the characters in Unicode. Surely
there must be a standard for this that is part of the Unicode spec? Or part of ISO/IEC 10646
spec?

One optimization doable on this would be to allow the user to declare tothe RDBMS what
subset of Unicode he is going to use. So, for instance, someone who is only handling
European languages might just say he wants to use 8859-1 thru 8859-9. Or a Japanese
company might throw in some more code pages but still not bring in code pages for
languages for which they do not create manuals.

That would make the collation table _much_ smaller.

I don't know anything about the collation order of Asian character sets. My guess though is
that each in toto is either greater or lesser than the various Euro pages. Though the non-
shifted part of Shift-JIS would be equal to its ASCII equivalents.

>My expectation (not the same as I'd like to see, necessarily, and not
>that my opinion counts -- I'm not a developer) would be that each
>database have a locale, and that this locale's collation order be used
>for indexing, LIKE, '<', '>' etc.

Characters like '<' and '>' already have standard collation orders vis a vis the other parts of
ASCII. I doubt these things vary by locale. But maybe I'm wrong.

>If you want to store data from
>multiple human languages using a locale that has Unicode for its
>character set would be appropriate/necessary.

So you are saying that the same characters can have a different collation order when they
appear in different locales even if they have the same encoding in all of them?

If so, then Unicode is really not a locale. Its an encoding but it is not a locale.

>Regards,
>
>Giles
>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2000-06-22 02:29:42 Re: Big 7.1 open items
Previous Message Randall Parker 2000-06-22 01:45:20 Re: An idea on faster CHAR field indexing