Re: multibyte-character aware support for function "downcase_truncate_identifier()"

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 23:09:14
Message-ID: AANLkTikweY9M4vfR0KmKwZiit-w8siSgsSk3x6iuj8Rz@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Nov 21, 2010 at 4:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Jul 7, 2010 at 10:07 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> IIRC this is intentional.  Please consult the archives for previous
>>> discussions.
>
>> Why would this be intentional?
>
> Well, it's intentional for lack of any infrastructure that would allow
> a more spec-compliant approach.  As you say, calling str_tolower here
> is probably a non-starter for performance reasons.  Another big problem
> is that str_tolower produces a locale-specific downcasing conversion.
> This (a) is going to create portability headaches of the first magnitude,
> and (b) is not really an advance in terms of spec compliance.  The SQL
> spec says that identifier case folding should be done according to the
> Unicode standard, but it's not safe to assume that any random
> platform-specific locale is going to act that way.  A specific example
> of a locale that is known to NOT behave acceptably is Turkish: they have
> weird ideas about i versus I, which in fact broke things back when we
> used to use tolower for this purpose.  See the archives from early 2004,
> and in particular commit 59f9a0b9df0d224bb62ff8ec5b65e0b187655742, which
> removed the exact same logic (though not wide-character-aware) that this
> patch proposes to put back.
>
> I think the given patch can be rejected out of hand.  If the OP has any
> ideas about doing non-locale-dependent case folding at an acceptable
> speed, I'm happy to listen.

I think that's fair. It actually doesn't seem like it should be that
hard if we knew that the server encoding were UTF8 - it's just a big
translation table somewhere, no? We use heuristics to copy as many
characters as possible without detailed examination and consult the
lookup table for the rest. However, that's not very practical in the
face of more than one encoding that must be handled. What sort of
infrastructure would actually be useful for dealing with this problem?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-11-21 23:21:36 Re: knngist - 0.8
Previous Message Josh Berkus 2010-11-21 23:07:20 Re: Spread checkpoint sync