Re: Why format() adds double quote?

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Tatsuo Ishii" <ishii(at)postgresql(dot)org>
Cc: pavel(dot)stehule(at)gmail(dot)com,listas(at)guedesoft(dot)net,robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why format() adds double quote?
Date: 2016-01-27 15:37:14
Message-ID: e3788f38-83e5-4036-9fd7-faa6ea32b774@mm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tatsuo Ishii wrote:

> 2) What does the SQL standard say? Do they say that non ASCII white
> spaces should be treated as ASCII white spaces?

I've used white space in the example, but I'm concerned about
punctuation too.

unicode.org has this helpful paper:
http://www.unicode.org/L2/L2000/00260-sql.pdf
which studies Unicode in SQL-99 identifiers.

The relevant BNF they extracted from the standard looks like this:

identifier body> ::=
<identifier start>
[ { <underscore> | <identifier part> }... ]

<identifier start> ::=
<initial alphabetic character>
| <ideographic character>

<identifier part> ::=
<alphabetic character>
| <ideographic character>
| <decimal digit character>
| <identifier combining character>
| <underscore>
| <alternate underscore>
| <extender character>
| <identifier ignorable character>
| <connector character>

<delimited identifier> ::=
<double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::= <delimited identifier part>...

<delimited identifier part> ::=
<nondoublequote character>
| <doublequote symbol>

========

The current version of quote_ident() plays it safe by implementing
the rule that, as soon it encounters a character outside
of US-ASCII, it surrounds the identifier with double quotes, no matter
to which category or block this character belongs.
So its output is guaranteed to be compatible with the above grammar.

The change in the patch is that multibyte characters just don't imply
quoting.

But according to the points 1 and 2 of the paper, the first character
must have the Unicode alphabetic property, and it must not
have the Unicode combining property.

I'm mostly ignorant in Unicode so I'm not sure of the precise
implications of having such Unicode properties, but still my
understanding is that the new quote_ident() ignores these rules,
so in this sense it could produce outputs that wouldn't be
compatible with SQL-99.

Also, here's what we say in the manual about non quoted identifiers:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html

"SQL identifiers and key words must begin with a letter (a-z, but also
letters with diacritical marks and non-Latin letters) or an underscore
(_). Subsequent characters in an identifier or key word can be
letters, underscores, digits (0-9), or dollar signs ($)"

So it explicitly allows letters in general (and also seems less
strict than SQL-99 about underscore), but it makes no promise about
Unicode punctuation or spaces, for instance, even though in practice
the parser seems to accept them just fine.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2016-01-27 16:00:16 Re: proposal: PL/Pythonu - function ereport
Previous Message Dilip Kumar 2016-01-27 15:27:00 Re: Patch: fix lock contention for HASHHDR.mutex