Quick Links

Re: tolower() identifier downcasing versus multibyte encodings

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org, "Francisco Figueiredo Jr(dot)" <francisco(at)npgsql(dot)org>
Subject:	Re: tolower() identifier downcasing versus multibyte encodings
Date:	2011-09-06 02:18:24
Message-ID:	201109060218.p862IOZ23903@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Did we ever address this?

---------------------------------------------------------------------------

Tom Lane wrote:
> I've been able to reproduce the behavior described here:
> http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php
> It's specific to UTF8 locales on Mac OS X. I'm not sure if the
> problem can manifest anywhere else; considering that OS X's UTF8
> locales have a general reputation of being broken, it may only
> happen on that platform.
>
> What is happening is that downcase_truncate_identifier() tries to
> downcase identifiers like this:
>
> unsigned char ch = (unsigned char) ident[i];
>
> if (ch >= 'A' && ch <= 'Z')
> ch += 'a' - 'A';
> else if (IS_HIGHBIT_SET(ch) && isupper(ch))
> ch = tolower(ch);
> result[i] = (char) ch;
>
> This is of course incapable of successfully downcasing any multibyte
> characters, but there's an assumption that isupper() won't return TRUE
> for a character fragment in a multibyte locale. However, on OS X
> it seems that that's not the case :-(. For the particular example
> cited by Francisco Figueiredo, I see the byte sequence \303\251
> converted to \343\251, because isupper() returns TRUE for \303 and
> then tolower() returns \343. The byte \251 is not changed, but the
> damage is already done: we now have an invalidly-encoded string.
>
> It looks like the blame for the subsequent "disappearance" of the bogus
> data lies with fprintf back on the client side; that surprises me a bit
> because I'd only heard of glibc being so cavalier with data it thought
> was invalidly encoded. But anyway, the origin of the problem is in the
> downcasing transformation.
>
> We could possibly fix this by not attempting the downcasing
> transformation on high-bit-set characters unless the encoding is
> single-byte. However, we have the exact same downcasing logic embedded
> in the functions in src/port/pgstrcasecmp.c, and those don't have any
> convenient way of knowing what the prevailing encoding is --- when
> compiled for frontend use, they can't use pg_database_encoding_max_length.
>
> Or we could bite the bullet and start using str_tolower(), but the
> performance implications of that are unpleasant; not to mention that
> we really don't want to re-introduce the "Turkish problem" with
> unexpected handling of i/I in identifiers.
>
> Or we could go the other way and stop downcasing non-ASCII letters
> altogether.
>
> None of these options seem terribly attractive. Thoughts?
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

In response to

tolower() identifier downcasing versus multibyte encodings at 2011-03-19 04:10:58 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2011-09-06 02:25:03	Re: limit in subquery causes poor selectivity estimation
Previous Message	daveg	2011-09-06 02:17:28	Re: [GENERAL] pg_upgrade problem