Re: Inaccurate documentation about identifiers

From: raf <raf(at)raf(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Inaccurate documentation about identifiers
Date: 2022-11-17 22:47:32
Message-ID: Y3a6BMoEzbcZ0rEy@raf.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 17, 2022 at 03:01:10PM -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > On Wed, 2022-11-16 at 08:36 -0500, Brennan Vincent wrote:
> >> However, it seems that all non-ASCII characters are considered
> >> "letters"
>
> > You're correct: it seems to allow any byte with the high bit set;
> > including, for example, a zero-width space.
>
> Yes, see scan.l:
>
> ident_start [A-Za-z\200-\377_]
> ident_cont [A-Za-z\200-\377_0-9\$]
>
> identifier {ident_start}{ident_cont}*
>
> > I don't think we want to change the documentation here, because that
> > would amount to a promise that we support such identifiers forever.
> > I also don't think we want to change the code, because it opens up
> > several problems and I'm not sure it's worth trying to solve them.
>
> Right. IIRC, the SQL spec would have us allow only things that actually
> are letters per Unicode or other relevant spec, but (1) that's rather
> encoding-dependent and (2) the hit to parsing speed would likely be
> non-negligible. Still, we might do it someday if someone can find
> a way around those concerns. (Accepting whitespace, in particular,
> is Not Great.) I think benign neglect in the docs is the best path.
>
> regards, tom lane

I think a lot of programming languages probably only use ASCII for
operators and whitespace.

I have a domain specific micro language that explicitly treats all
8-bit bytes as "letters" when parsing the names of things as a cheap
way to "support" ASCII-compatible encodings like UTF-8 and ISO-8859-*
(but it's useless for UTF-16, GB 18030, Big5, ...). The only way to
do it right would be to decode everything. But then you'd probably
lose the ability to include emojis in identifiers. I wonder if anyone's
doing that in postgresql. :-)

Does the SQL spec require accepting *only* real letters as letters,
or does it require accepting *at least* real letters as letters. :-)
Just a bit of wishful thinking.

cheers,
raf

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-11-18 05:41:54 BUG #17689: Two UPDATE operators in common table expressions (CTE) perform not as expected
Previous Message Tom Lane 2022-11-17 20:01:10 Re: Inaccurate documentation about identifiers