Re: invalidly encoded strings

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: invalidly encoded strings
Date: 2007-09-09 14:51:39
Message-ID: 18120.1189349499@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> I have been looking at fixing the issue of accepting strings that are
> not valid in the database encoding. It appears from previous discussion
> that we need to add a call to pg_verifymbstr() to the relevant input
> routines and ensure that the chr() function returns a valid string. That
> leaves several issues:

> . which are the relevant input routines? I have identified the following
> as needing remediation: textin(), bpcharin(), varcharin(), anyenum_in(),
> namein(). Do we also need one for cstring_in()? Does the xml code
> handle this as part of xml validation?

This seems entirely the wrong approach, because 99% of the time a
check placed here will be redundant with the one in the main
client-input logic. (That was, indeed, the reason I took such checks
out of these places in the first place.) Moreover, as you've already
found out there are N places that would have to be fixed and we'd face
a constant hazard of new datatypes introducing new holes.

AFAICS the risk comes only from chr() and the possibility of numeric
backslash-escapes in SQL string literals, and we ought to think about
fixing it in those places.

A possible answer is to add a verifymbstr to the string literal
converter anytime it has processed a numeric backslash-escape in the
string. Open questions for that are (1) does it have negative effects
for bytea, and if so is there any hope of working around it? (2) how
can we postpone the conversion/test to the parse analysis phase?
(To the extent that database encoding is frozen it'd probably be OK
to do it in the scanner, but such a choice will come back to bite
us eventually.)

> . for chr() under UTF8, it seems to be generally agreed that the
> argument should represent the codepoint and the function should return
> the correspondingly encoded character. If so, possible the argument
> should be a bigint to accommodate the full range of possible code
> points. It is not clear what the argument should represent for other
> multi-byte encodings for any argument higher than 127. Similarly, it is
> not clear what ascii() should return in such cases. I would be inclined
> just to error out.

In SQL_ASCII I'd argue for allowing 0..255. In actual MB encodings,
OK with throwing error.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-09-09 15:10:40 Re: tsearch filenames unlikes special symbols and numbers
Previous Message Andrew Dunstan 2007-09-09 11:46:19 Re: invalidly encoded strings

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2007-09-09 15:22:53 Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Previous Message Simon Riggs 2007-09-09 12:25:44 Re: HOT patch - version 15