Re: BUG #15273: Lexer bug with UESCAPE

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ladayaroslav(at)yandex(dot)ru, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15273: Lexer bug with UESCAPE
Date: 2018-07-11 12:03:41
Message-ID: 87bmbekq90.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Also, I'm going to push back on the claim that allowing comments
Tom> there is required by the SQL spec. The relevant rules in SQL:2011
Tom> are

Tom> <Unicode character string literal> ::=
Tom> [ <introducer> <character set specification> ]
Tom> U <ampersand> <quote> [ <Unicode representation>... ] <quote>
Tom> [ { <separator> <quote> [ <Unicode representation>... ] <quote> }... ]
Tom> <Unicode escape specifier>

Tom> <Unicode escape specifier> ::=
Tom> [ UESCAPE <quote> <Unicode escape character> <quote> ]

Tom> I do not see any principled way of arguing that these rules
Tom> require comments to be allowed adjacent to UESCAPE without also
Tom> claiming that they must be allowed between, say, the initial 'U'
Tom> and the ampersand.

These are the rules that (as far as I can see) apply to that case:

5.2 <token> and <separator>

<separator> ::=
{ <comment> | <white space> }...

7) Any <token> may be followed by a <separator>.

5.3 <literal>

11) In a <Unicode character string literal>, there shall be no
<separator> between the "U" and the <ampersand> nor between the
<ampersand> and the <quote>.

Tom> The only place these rules allow a <separator> is between segments
Tom> of a multiline literal. It looks to me like an extension that we
Tom> even allow whitespace around UESCAPE.

I think that that use of <separator> is only to indicate that a
<separator> there is _required_, rather than optional as it usually is
after tokens, and that the special rule about requiring newlines also
applies only to that specific use of <separator>.

If the whole <Unicode character string literal> is regarded as being a
single token, and therefore rule 5.2.7 above didn't apply around the
UESCAPE, then there would be no reason to write rule 5.3.11 forbidding
separators within the U&' part.

(In the case of X'...', there's rule 5.2.5, which as I see it would
prevent a space after the X, but that rule explicitly does not apply to
the U& cases.)

As a related issue, we don't allow comments within the <separator> that
splits a multiline literal, even though the spec certainly allows those
(arguably, since the spec defines that comments are equivalent to
newlines, "select 'foo' /**/ 'bar';" should be legal too).

I've put up a summary of all these at
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#Lexing_of_string_literals_and_comments

(under the assumption that the whole issue is filed under WONTFIX at
least for the time being)

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-07-11 17:14:17 BUG #15275: Trigger don't take supperuser role into account to create role
Previous Message Michael Paquier 2018-07-11 11:30:48 Re: Problem with tupdesc in jsonb_to_recordset