Re: Unicode combining characters

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Patrice Hédé <phede-ml(at)islande(dot)org>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode combining characters
Date: 2001-09-25 09:50:51
Message-ID: Pine.GSO.4.33.0109251248510.18015-100000@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Looks like a good project for 7.3
Probably the best starting point would be to develope contrib/unicode
with smooth transition to core.

Oleg
On Mon, 24 Sep 2001, Patrice [iso-8859-15] HИdИ wrote:

> Hi all,
>
> while working on a new project involving PostgreSQL and making some
> tests, I have come up with the following output from psql :
>
> lang | length | length | text | text
> ------+--------+--------+-----------+-----------
> isl | 7 | 6 | АlМta | Аleit
> isl | 7 | 7 | АlМta | Аlitum
> isl | 7 | 7 | АlМta | АlitiП
> isl | 5 | 4 | maПur | mann
> isl | 5 | 7 | maПur | mЖnnum
> isl | 5 | 5 | maПur | manna
> isl | 5 | 4 | Сska | -aПi
>
> [the misalignment is what I got, it's not a copy-paste error]
>
> This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
> querying a database created with -E UNICODE (by the way, these are
> icelandic words :) ).
>
> What you see above is misleading, since it's not possible to see that
> 'А', 'М', 'С' and 'Ж' are using combining marks, while 'П' is not.
>
> As a reminder, a combining mark in Unicode is that А is actually
> encoded as a + ' (where ' is the acute combining mark).
>
> Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
> instead of <c3 a1> [UTF16: 00E1].
>
> The "length" fields are what is returned by length(a.text) and
> length(b.text).
>
> So, this shows two problems :
>
> - length() on the server side doesn't handle correctly Unicode [I have
> the same result with char_length()], and returns the number of chars
> (as it is however advertised to do), rather the length of the
> string.
>
> - the psql frontend makes the same mistake.
>
> I am using version 7.1.3 (debian sid), so it may have been corrected
> in the meantime (in this case, I apologise, but I have only recently
> started again to use PostgreSQL and I haven't followed -hackers long
> enough).
>
>
> => I think fixing psql shouldn't be too complicated, as the glibc
> should be providing the locale, and return the right values (is this
> the case ? and what happens for combined latin + chinese characters
> for example ? I'll have to try that later). If it's not fixed already,
> do you want me to look at this ? [it will take some time, as I haven't
> set up any development environment for postgres yet, and I'm away for
> one week from thursday].
>
> => regarding the backend, it may be more complex, as the underlaying
> system may not provide any UTF-8 locale to use (!= from being UTF-8
> aware : an administrator may have decided that UTF-8 locales are
> useless on a server, as only root connections are made, and he wants
> only the C locale on the console - I've seen that quite often ;) ).
>
>
> This brings me to another subject : I will need to support the full
> Unicode collation algorithm (UCA, as described in TR#10 [1] of the
> Unicode consortium), and I will need to be able to sort according to
> locales which may not be installed on the backend server (some of
> which may not even be recognised by GNU libc, which supports already
> more than 140 locales -- artificial languages would be an example). I
> will also need to be able to normalise the unicode strings (TR#15 [2])
> so that I don't have some characters in legacy codepoints [as 00E1
> above], and others with combining marks.
>
> There is today an implementation in perl of the needed functionality,
> in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
> :( ). But as they are Perl modules, the untrusted version of perl,
> plperlu, will be needed, and it's a pity for what I consider a core
> functionality in the future (not that plperlu isn't a good thing - I
> can't wait for it ! - but that an untrusted pl language is needed to
> support normalisation and collation).
>
> Note also that there are a lot of data associated with these
> algorithms, as you could expect.
>
> I was wondering if some people have already thought about this, or
> already done something, or if some of you are interested in this. If
> nobody does anything, I'll do something eventually, probably before
> Christmas (I don't have much time for this, and I don't need the
> functionality right now), but if there is an interest, I could team
> with others and develop it faster :)
>
> Anyway, I'm open to suggestions :
>
> - implement it in C, in the core,
>
> - implement it in C, as contributed custom functions,
>
> - implement it in perl (by reusing Unicode:: work), in a trusted plperl,
>
> - implement it in perl, calling Unicode:: modules, in an untrusted
> plperl.
>
> and then :
>
> - provide the data in tables (system and/or user) - which should be
> available across databases,
>
> - load the data from the original text files provided in Unicode (and
> other as needed), if the functionality is compiled into the server.
>
> - I believe the basic unicode information should be standard, and the
> locales should be provided as contrib/ files to be plugged in as
> needed.
>
> I can't really accept a solution which would rely on the underlaying
> libc, as it may not provide the necessary locales (or maybe, then,
> have a way to override the collating tables by user tables - actually,
> this would be certainly the best solution if it's in the core, as the
> tables will put an extra burden on the distribution and the
> installation footprint, especially if the tables are already there,
> for glibc, for perl5.6+, for other software dealing with Unicode).
>
> The main functions I foresee are :
>
> - provide a normalisation function to all 4 forms,
>
> - provide a collation_key(text, language) function, as the calculation
> of the key may be expensive, some may want to index on the result (I
> would :) ),
>
> - provide a collation algorithm, using the two previous facilities,
> which can do primary to tertiary collation (cf TR#10 for a detailed
> explanation).
>
> I haven't looked at PostgreSQL code yet (shame !), so I may be
> completely off-track, in which case I'll retract myself and won't
> bother you again (on that subject, that is ;) )...
>
> Comments ?
>
>
> Patrice.
>
> [1] http://www.unicode.org/unicode/reports/tr10/
>
> [2] http://www.unicode.org/unicode/reports/tr15/
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Patrick Welche 2001-09-25 13:07:29 Re: anoncvs failure...
Previous Message Jean-Michel POURE 2001-09-25 07:25:14 Re: [ODBC] UTF-8 support