Re: Why format() adds double quote?

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: daniel(at)manitou-mail(dot)org
Cc: pavel(dot)stehule(at)gmail(dot)com, listas(at)guedesoft(dot)net, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why format() adds double quote?
Date: 2016-01-28 00:00:29
Message-ID: 20160128.090029.781286852790195741.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I've used white space in the example, but I'm concerned about
> punctuation too.
>
> unicode.org has this helpful paper:
> http://www.unicode.org/L2/L2000/00260-sql.pdf
> which studies Unicode in SQL-99 identifiers.
>
> The relevant BNF they extracted from the standard looks like this:
>
> identifier body> ::=
> <identifier start>
> [ { <underscore> | <identifier part> }... ]
>
> <identifier start> ::=
> <initial alphabetic character>
> | <ideographic character>
>
> <identifier part> ::=
> <alphabetic character>
> | <ideographic character>
> | <decimal digit character>
> | <identifier combining character>
> | <underscore>
> | <alternate underscore>
> | <extender character>
> | <identifier ignorable character>
> | <connector character>
>
> <delimited identifier> ::=
> <double quote> <delimited identifier body> <double quote>
>
> <delimited identifier body> ::= <delimited identifier part>...
>
> <delimited identifier part> ::=
> <nondoublequote character>
> | <doublequote symbol>
>
> ========
>
> The current version of quote_ident() plays it safe by implementing
> the rule that, as soon it encounters a character outside
> of US-ASCII, it surrounds the identifier with double quotes, no matter
> to which category or block this character belongs.
> So its output is guaranteed to be compatible with the above grammar.
>
> The change in the patch is that multibyte characters just don't imply
> quoting.
>
> But according to the points 1 and 2 of the paper, the first character
> must have the Unicode alphabetic property, and it must not
> have the Unicode combining property.

Good point.

> I'm mostly ignorant in Unicode so I'm not sure of the precise
> implications of having such Unicode properties, but still my
> understanding is that the new quote_ident() ignores these rules,
> so in this sense it could produce outputs that wouldn't be
> compatible with SQL-99.
>
> Also, here's what we say in the manual about non quoted identifiers:
> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html
>
> "SQL identifiers and key words must begin with a letter (a-z, but also
> letters with diacritical marks and non-Latin letters) or an underscore
> (_). Subsequent characters in an identifier or key word can be
> letters, underscores, digits (0-9), or dollar signs ($)"
>
> So it explicitly allows letters in general (and also seems less
> strict than SQL-99 about underscore), but it makes no promise about
> Unicode punctuation or spaces, for instance, even though in practice
> the parser seems to accept them just fine.

You could arbitary extend your point, not only with Unicode
punctuation or spaces, There are number of characters look-alike "-"
in Unicode, for example. Do we want to treat them like ASCII "-"?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-01-28 00:04:48 Re: [PATCH] we have added support for box type in SP-GiST index
Previous Message Dickson S. Guedes 2016-01-27 22:47:00 Re: Why format() adds double quote?