Unicode combining characters

From: Patrice Hédé <phede-ml(at)islande(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Unicode combining characters
Date: 2001-09-24 21:52:04
Message-ID: 20010924235204.L1316@idf.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :

lang | length | length | text | text
------+--------+--------+-----------+-----------
isl | 7 | 6 | álíta | áleit
isl | 7 | 7 | álíta | álitum
isl | 7 | 7 | álíta | álitið
isl | 5 | 4 | maður | mann
isl | 5 | 7 | maður | mönnum
isl | 5 | 5 | maður | manna
isl | 5 | 4 | óska | -aði

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'á', 'í', 'ó' and 'ö' are using combining marks, while 'ð' is not.

As a reminder, a combining mark in Unicode is that á is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).

This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1] of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2])
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted
plperl.

and then :

- provide the data in tables (system and/or user) - which should be
available across databases,

- load the data from the original text files provided in Unicode (and
other as needed), if the functionality is compiled into the server.

- I believe the basic unicode information should be standard, and the
locales should be provided as contrib/ files to be plugged in as
needed.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

Patrice.

[1] http://www.unicode.org/unicode/reports/tr10/

[2] http://www.unicode.org/unicode/reports/tr15/

--
Patrice HÉDÉ ------------------------------- patrice à islande.org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gowey, Geoffrey 2001-09-24 22:41:57 ODBC driver flakieness
Previous Message Colin 't Hart 2001-09-24 20:51:14 Re: [HACKERS] not on .hackers