Re: 7.4.1 release status - Turkish Locale

From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: <pgsql-hackers(at)postgreSQL(dot)org>
Cc: <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <devrim(at)tdmsoft(dot)com>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-01 01:55:39
Message-ID: 000701c3e866$8584d890$1d00a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> We might think that the Turkish-locale problem Devrim Gunduz pointed
out
> is a must-fix, too. But I'm not convinced yet what to do about it.

Here is a first try to fix what Devrim Gunduz talked about.

Please be patient with me for it is the first major patch
I submit and I realize that I blatantly violated many rules
of good style in PostgreSQL source code.

First, about the problem. Turkish language has two letters "i".
One is with dot on top and the other is without. Simply as that.
The one with dot has the dot both as capital and lower-case and
the one without dot has no dot in both upper and lower case...
as opposed to English where "i" has a dot when lower-case and
has no dot when upper-case.

Problem arise when PostgreSQL, while running with "tr_TR" locale
converts to lower-case an identifier as a table, an index or
a column name. If it is written with capital "I", tolower() with
'I' as argument will return Turkish specific character:
'i'-without-a-dot what I am afraid will not be shown correctly
in your e-mail readers.

Let me give some examples.

initdb script runs apparently innocent script in file
src/backend/utils/mb/conversion_procs/conversion_create.sql
to create a couple of functions whose only fault was
to declare it their return parameters as VOID. Backend
returns error message that type "vo d" is not found and
initdb fails.

A nothing suspecting novice user was excited about
SERIAL data type he was tail is present in PostgreSQL.
It took us with Devrim a lot of time to explain why he
need to type SERIAL as SERiAL for now till a workaround
is developed.

Another case happened with me when I wanted to restore
a pg_dump dump. Restore failed because script was creating
scripts that belong to PUBLIC.

For the solution, after some research we found out that
offender is tolower() call in src/backend/parser/scan.l
in {identifier} section. tolower() works fine with any
locale and with any character save for the Turkish locale
and capital 'I' character. So, the obvious solution is
to put a check for Turkish locale and 'I' character.
Something like this:

if( <locale is Turkish> && ident[i] == 'I' )
ident[i] = 'i';
else
ident[i] = tolower((unsigned char) ident[i]);

Looks rather simple but the hard part was to figure out
what is the current locale. To do this I added

const char *get_locale_category(const char *category);

to src/backend/utils/adt/pg_locale.c that would return
locale identifier for the category specified or LC_ALL
if category is NULL. I could not find any other function
that will return what I need. Please help me to find
one because I would hate to introduce a new function.

I realize that {identifier} section is very performance
critical so I introduced a global variable

static int isturkishlocale = -1;

at the beginning of src/backend/parser/scan.l
It is set to -1 when not yet initialized, 0 if
locale is not Turkish and 1 if locale is Turkish.

It might not be the way it is usually done in PostgreSQL
source code. Could you pleas advise if the name I chose
is appropriate and whether there is a more appropriate
place to put declaration and initialization.

Best regards,
Nicolai Tufar & Devrim Gunduz

Attachment Content-Type Size
trpatch.diff application/octet-stream 2.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2004-02-01 04:07:38 Re: [HACKERS] v7.4.1 text_position() patch
Previous Message Tatsuo Ishii 2004-02-01 00:34:51 Re: [PATCHES] v7.4.1 text_position() patch