Re: Bug in UTF8-Validation Code?

From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: andrew(at)supernews(dot)com
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 20:50:36
Message-ID: 46116C9C.80300@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mark Dilger wrote:
> Andrew - Supernews wrote:
>> On 2007-04-01, Mark Dilger <pgsql(at)markdilger(dot)com> wrote:
>>> Do any of the string functions (see
>>> http://www.postgresql.org/docs/8.2/interactive/functions-string.html)
>>> run the risk of generating invalid utf8 encoded strings? Do I need
>>> to add checks?
>>> Are there known bugs with these functions in this regard?
>>
>> The chr() function returns an octet, rather than a character; this is
>> clearly
>> wrong and needs fixing.
>>
>
> Ok, I've altered the chr() function. I am including a transcript from
> psql below. There are several design concerns:
>
> 1) In the current implementation, chr(0) returns a 5-byte text object
> (4-bytes of overhead plus one byte of data) containing a null. In the
> new implementation, this returns an error. I don't know, but it is
> possible that people currently use things like "SELECT chr(0) || chr(0)
> || ..." to build up strings of nulls.
>
> 2) Under utf8, chr(X) fails for X = 128..255. This may also break
> current users expectations.
>
> 3) The implicit modulus operation that was being performed by chr() is
> now gone, which might break some users.
>
> 4) You can't represent the high end of the astral plain with type
> INTEGER, unless you pass in a negative value, which is somewhat
> unintuitive. Since chr() expects an integer (and not a bigint) the user
> needs handle the sign bit correctly.
>
> mark
>
> ---------------------
>
>
>
>
> Welcome to psql 8.3devel, the PostgreSQL interactive terminal.
>
> Type: \copyright for distribution terms
> \h for help with SQL commands
> \? for help with psql commands
> \g or terminate with semicolon to execute query
> \q to quit
>
> pgsql=# select chr(0);
> ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(65);
> chr
> -----
> A
> (1 row)
>
> pgsql=# select chr(128);
> ERROR: character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(53398);
> chr
> -----
> Ж
> (1 row)
>
> pgsql=# select chr(14989485);
> chr
> -----
> 中
> (1 row)
>
> pgsql=# select chr(4036005254);
> ERROR: function chr(bigint) does not exist
> LINE 1: select chr(4036005254);
> ^
> HINT: No function matches the given name and argument types. You might
> need to add explicit type casts.

Here's the code for the new chr() function:

Datum
chr(PG_FUNCTION_ARGS)
{
int32 cvalue = PG_GETARG_INT32(0);
text *result;

if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
{
int encoding,
len,
byteoff;
uint32 buf[2];
const char *bufptr;

encoding = GetDatabaseEncoding();
buf[0] = htonl(cvalue);
buf[1] = 0;
bufptr = (const char *)&buf;
for (byteoff = 0; byteoff < sizeof(uint32) && 0 == *bufptr; ++byteoff,
++bufptr);
len = pg_encoding_mblen(encoding,bufptr);
if (byteoff + len != sizeof(uint32) || !pg_verify_mbstr(encoding,
bufptr, len, true /* noError */))
report_untranslatable_char(PG_SQL_ASCII, encoding, bufptr,
sizeof(int32));
result = (text *) palloc(VARHDRSZ + len);
SET_VARSIZE(result, VARHDRSZ + len);
memcpy(VARDATA(result),bufptr,len);
}
else
{
result = (text *) palloc(VARHDRSZ + 1);
SET_VARSIZE(result, VARHDRSZ + 1);
*VARDATA(result) = (char) cvalue;
}

PG_RETURN_TEXT_P(result);
}

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2007-04-02 20:50:49 Re: CheckpointStartLock starvation
Previous Message Gregory Stark 2007-04-02 20:43:00 Re: Is this portable?