Re: Bug in UTF8-Validation Code?

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-01 10:30:51
Message-ID: 20070401103051.GB15919@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 31, 2007 at 07:47:21PM -0700, Mark Dilger wrote:
> OK, I can take a stab at fixing this. I'd like to state some assumptions
> so people can comment and reply:
>
> I assume that I need to fix *all* cases where invalid byte encodings get
> into the database through functions shipped in the core distribution.

Yes.

> I assume I do not need to worry about people getting bad data into the
> system through their own database extensions.

That'd be rather difficult :)

> I assume that the COPY problem discussed up-thread goes away once you
> eliminate all the paths by which bad data can get into the system.
> However, existing database installations with bad data already loaded will
> not be magically fixed with these code patches.

Correct.

> Do any of the string functions (see
> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run
> the risk of generating invalid utf8 encoded strings? Do I need to add
> checks? Are there known bugs with these functions in this regard?

I don't think so. They'd be bugs if they were...

> If not, I assume I can add mbverify calls to the various input routines
> (textin, varcharin, etc) where invalid utf8 could otherwise enter the
> system.

The only hard part is handling where the escaping and unescaping is
happening...

> I assume that this work can be limited to HEAD and that I don't need to
> back-patch it. (I suspect this assumption is a contentious one.)

At the very least I'd start with HEAD. Whether it gets backpatched
probably depends on how invasive it ends up being...

There's also the performance angle. The current mbverify is very
inefficient for encodings like UTF-8. You might need to refactor a bit
there...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message korryd 2007-04-01 12:23:12 Re: Last minute mini-proposal (I know, Iknow)forPQexecf()
Previous Message Peter Eisentraut 2007-04-01 10:04:06 Re: Macros for typtype (was Re: Arrays of Complex Types)