Re: UTF-8 encoding problem w/ libpq

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Martin Schäfer <Martin(dot)Schaefer(at)cadcorp(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UTF-8 encoding problem w/ libpq
Date: 2013-06-03 17:44:21
Message-ID: 51ACD5F5.3030407@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 06/03/2013 12:22 PM, Heikki Linnakangas wrote:
> On 03.06.2013 18:27, ktm(at)rice(dot)edu wrote:
>> On Mon, Jun 03, 2013 at 04:09:29PM +0100, Martin Schäfer wrote:
>>>
>>>>> If I change the strCreate query and add double quotes around the
>>>>> column
>>>> name, then the problem disappears. But the original name is already in
>>>> lowercase, so I think it should also work without quoting the
>>>> column name.
>>>>> Am I missing some setup in either the database or in the use of
>>>>> libpq?
>>>>>
>>>>> I’m using PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
>>>>>
>>>>> The database uses:
>>>>> ENCODING = 'UTF8'
>>>>> LC_COLLATE = 'English_United Kingdom.1252'
>>>>> LC_CTYPE = 'English_United Kingdom.1252'
>>>>>
>>>>> Thanks for any help,
>>>>>
>>>>> Martin
>>>>>
>>>>
>>>> Hi Martin,
>>>>
>>>> If you do not want the lowercase behavior, you must put double-quotes
>>>> around the column name per the documentation:
>>>>
>>>> http://www.postgresql.org/docs/9.2/interactive/sql-syntax-
>>>> lexical.html#SQL-SYNTAX-IDENTIFIERS
>>>>
>>>> section 4.1.1.
>>>>
>>>> Regards,
>>>> Ken
>>>
>>> The original name 'id_äß' is already in lowercase. The backend
>>> should leave it unchanged IMO.
>>
>> Only in utf-8 which needs to be double-quoted for a column name as
>> you have
>> seen, otherwise the value will be lowercased per byte.
>
> He *is* using UTF-8. Or trying to, anyway :-). The downcasing in the
> backend is supposed to leave bytes with the high-bit set alone, ie. in
> UTF-8 encoding, it's supposed to leave ä and ß alone.
>
> I suspect that the conversion to UTF-8, before the string is sent to
> the server, is not being done correctly. I'm not sure what's wrong
> there, but I'd suggest printing the actual byte sequence sent to the
> server, to check if it's in fact valid UTF-8. ie. replace the PQexec()
> line with something like:
>
> const char *s = ToUtf8(strCreate.c_str()).c_str();
> int i;
> for (i=0; s[i]; i++)
> printf("%02x", (unsigned char) s[i]);
> printf("\n");
> pResult = PQexec(pConn, s);
>
> That should contain the UTF-8 byte sequence for äß, "c3a4c39f"
>
>

Umm, no, the backend code doesn't do it right. Some time ago I suggested
a fix for this - see
<http://www.postgresql.org/message-id/50ACF7FA.7070108@dunslane.net>.
Tom thought there might be other places that need fixing, and I haven't
had time to look for them. But maybe we should just fix this one for now
at least.

cheers

andrew

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-03 17:46:26 Re: Vacuum, Freeze and Analyze: the big picture
Previous Message Josh Berkus 2013-06-03 17:28:07 Re: Vacuum, Freeze and Analyze: the big picture