Re: Unicode upper() bug still present

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)tm(dot)ee>, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode upper() bug still present
Date: 2003-10-20 20:58:00
Message-ID: Pine.LNX.4.44.0310202235580.29086-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane writes:

> I'm not sure that "supporting our own locale subsystem" really qualifies
> as "sustainable" ... can you give an estimate of how big the code +
> supporting data is likely to be?

It's not much worse than supporting our own character conversion subsystem
(which, btw., is something we could more likely do without, because the
standard system facilities tend to be quite adequate), and certainly much
less worse than maintaining our own set of translated strings.

For the "ctype" category, you can generate the code straight out of the
Unicode tables, with a handfull of hardcoded exception (like the Turkish
i). For the "collate" category we need about 40 kB of language-specific
data files plus a big master data file that is maintained by the Unicode
consortium. (Those 40 kB correspond to the 22 files I currently have,
which, together with the big default file, cover about 70 languages.)
The other locale categories aren't of interest for string processing.
The code isn't large, but of course someone needs to write it. The
algorithms are standardized (Unicode collation algorithm) and have several
existing implementations. So this isn't something that we would need to
maintain in a vacuum.

(Note that I say Unicode a lot here because those people do a lot of
research and standardization in this area, which is available for free,
but this does not constrain the result to work only with the Unicode
character set.)

> I agree that depending on the system-provided locale behavior has its
> downsides, but it has its upsides too; compatibility with the behavior
> of everything else on the machine being one big one. So the idea of
> being able to use glibc where available shouldn't be rejected out of
> hand, I think.

I like to think that in the end we can do much better than the POSIX
framework can do. For instance, the character classification can have
more useful categories, the case conversion can be context-dependent
(which is a requirement in some languages), and users could more directly
add their own collations or parametrize existing ones (because no one ever
seems to agree on the details).

--
Peter Eisentraut peter_e(at)gmx(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-10-20 21:15:41 Re: Unicode upper() bug still present
Previous Message Anthony W. Youngman 2003-10-20 20:50:17 Re: Dreaming About Redesigning SQL