Re: Unicode string literals versus the world

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode string literals versus the world
Date: 2009-04-14 15:54:33
Message-ID: 12063.1239724473@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Marko Kreen <markokr(at)gmail(dot)com> writes:
> I would prefer that such quoting extensions would wait until
> stdstr=on setting is the only mode Postgres will operate.
> Fitting new quoting ways to environment with flippable stdstr setting
> will be rather painful for everyone.

It would certainly be a lot safer to wait until non-standard-conforming
strings don't exist anymore. The problem is that that may never happen,
and is certainly not on the roadmap to happen in the foreseeable future.

> I still stand on my proposal, how about extending E'' strings with
> unicode escapes (eg. \uXXXX)? The E'' strings are already more
> clearly defined than '' and they are our "own", we don't need to
> consider random standards, but can consider our sanity.

That's one way we could proceed. The other proposal that seemed
attractive to me was a decode-like function:

uescape('foo\00e9bar')
uescape('foo\00e9bar', '\')

(double all the backslashes if you assume not
standard_conforming_strings). The arguments in favor of this one
are (1) you can apply it to the result of an expression, it's not
strictly tied to literals; and (2) it's a lot lower-footprint solution
since it doesn't affect basic literal handling. If you wish to suppose
that this is only a stopgap until someday when we can implement the SQL
standard syntax more safely, then low footprint is good. One could
even imagine back-porting this into existing releases as a user-defined
function.

The solution with \u in extended literals is probably workable too.
I'm slightly worried about the possibility of issues with code that
thinks it knows what an E-literal means but doesn't really. In
particular something might think it knows that "\u" just means "u",
and proceed to strip the backslash. I don't see a path for that to
become a security hole though, only a garden-variety bug. So I could
live with that one on the grounds of being easier to use (which it
would be, because of less typing compared to uescape()).

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-04-14 16:17:45 Re: proposal: add columns created and altered to pg_proc and pg_class
Previous Message Greg Stark 2009-04-14 15:49:45 Re: Unicode support