Re: Bug in UTF8-Validation Code?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 12:25:56
Message-ID: 45FD2FD4.8070406@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote:
>
>> How can we fix this? Frankly, the statement in the docs warning about
>> making sure that escaped sequences are valid in the server encoding is a
>> cop-out. We don't accept invalid data elsewhere, and this should be no
>> different IMNSHO. I don't see why this should be any different from,
>> say, date or numeric data. For years people have sneered at MySQL
>> because it accepted dates like Feb 31st, and rightly so. But this seems
>> to me to be like our own version of the same problem.
>>
>
> It seems to me that the easiest solution would be to forbid \x?? escape
> sequences where it's greater than \x7F for UTF-8 server encodings.
> Instead introduce a \u escape for specifying the unicode character
> directly. Under the basic principle that any escape sequence still has
> to represent a single character. The result can be multiple bytes, but
> you don't have to check for consistancy anymore.
>
> Have a nice day,
>

The escape processing is actually done in the lexer in the case of
literals. We have to allow for bytea literals there too, regardless of
encoding. The lexer naturally has no notion of the intended destination
of the literal, So we need to defer the validity check to the *in
functions for encoding-aware types. And it as Tom has noted, COPY does
its own escape processing but does it before the transcoding.

So ISTM that any solution other than something like I have proposed will
probably involve substantial surgery.

It does also seem from my test results that transcoding to MB charsets
(or at least to utf-8) is surprisingly expensive, and that this would be
a good place to look at optimisation possibilities. The validity tests
can also be somewhat expensive.

But correctness matters most, IMNSHO.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikolay Samokhvalov 2007-03-18 13:37:46 Re: [PATCHES] xpath_array with namespaces support
Previous Message Josh Berkus 2007-03-18 12:12:48 Re: Buildfarm feature request: some way to track/classify failures